# SCRAPY [DOCUMENTATION](https://docs.scrapy.org/en/latest/index.html)

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

# Basic commands

## Scrapy command line

|Scrapy command line|Description|
|-|-|
|**Help**||
|`scrapy -h`|(`--help`) list all the available commands (run from the project's directory)|
|`scrapy <command> -h`|help on the given command|
|**List info**||
|`scrapy list`|list all available crawlers (run from the project's deirectory)|
|**Crawl a random webpage in shell**||
|`scrapy shell <url>`||
|(inside shell) `view(response)`|open the resonse object in your browser|
|**Project**||
|`scrapy startproject project_name [project_dir]`|create a new project|
|`scrapy genspider [-t template] <name> <domain or URL>`|Create a new spider in the current folder or in the current project’s spiders folder,|
|`scrapy crawl spider_name`|run the spider (from the project's top level dir)|
|||
|||
|||
|||

```
scrapy --help
Scrapy 2.11.0 - active project: tutorial

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
```

## Scrapy extraction most common tools

|Scrapy extraction tools|Description|
|-|-|
|`view(response)`|open the response page from the shell in your web browser|
|**Response status codes**||
|`response.status`||
|**CSS selectors**||
|`response.css`||
|`response.css("title::text").getall()`|get only text from the SelectorList|
|||
|||
|||

# <b>1. Scrapy tutorial</b>

This tutorial will walk you through these tasks:

- Creating a new Scrapy project
- Writing a spider to crawl a site and extract data
- Exporting the scraped data using the command line
- Changing spider to recursively follow links
- Using spider arguments

# 1.0 Installation guide

We strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system packages.

```sh
(venv) $ pip install Scrapy
```

Scrapy is written in pure Python and depends on a few key Python packages (among others):

- lxml, an efficient XML and HTML parser
- parsel, an HTML/XML data extraction library written on top of lxml,
- w3lib, a multi-purpose helper for dealing with URLs and web page encodings
- twisted, an asynchronous networking framework
- cryptography and pyOpenSSL, to deal with various network-level security needs

Some of these packages themselves depend on non-Python packages that might require additional installation steps depending on your platform. Please check [platform-specific guides](https://docs.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes).

In case of any trouble related to these dependencies, please refer to their respective installation instructions:

- [lxml installation](https://lxml.de/installation.html)
- [cryptography installation](https://cryptography.io/en/latest/installation/)

# 1.1 Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

```sh
scrapy startproject tutorial
```
```
New Scrapy project 'tutorial', using template directory '/home/commi/venv/venv3.11/lib/python3.11/site-packages/scrapy/templates/project', created in:
    /home/commi/Yandex.Disk/it_learning/08_web_scraping/02_scrapy/data/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
```

This will create a `tutorial` directory with the following contents:

In [2]:
cd ./data
tree tutorial

[01;34mtutorial[0m
├── scrapy.cfg
└── [01;34mtutorial[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


```
tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py
```

# 1.2 Our first Spider

**Spiders** are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `Spider` and define the initial requests to make, optionally 
- how to follow links in the pages, and 
- how to parse the downloaded page content to extract data.

This is the code for our first `Spider`. Save it in a file named `quotes_spider.py` under the `tutorial/spiders` directory in your project:

In [5]:
ls -R

.:
draft.py  [0m[01;34m__pycache__[0m  quotes.jsonl  quotes_spider.py  [01;34mtutorial[0m

./__pycache__:
draft.cpython-311.pyc

./tutorial:
scrapy.cfg  [01;34mtutorial[0m

./tutorial/tutorial:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  [01;34mspiders[0m

./tutorial/tutorial/spiders:
__init__.py


```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```

As you can see, our Spider subclasses `scrapy.Spider` and defines some attributes and methods:

- `name`: identifies the Spider. It **must be unique within a project**, that is, you can’t set the same name for different Spiders.

- `start_requests()`: must return an iterable of `Request`s (you can return a **list** of requests or write a **generator** function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

- `parse()`: a method that will be called to handle the response downloaded for each of the requests made. The `response` parameter is an instance of `TextResponse` that holds the page content and has further helpful methods to handle it.

The `parse()` method usually parses the `response`, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

_ChatGPT:_  
In Scrapy, when you use the `yield` statement within a Spider callback method like `parse`, the yielded items are not stored directly. Instead, they are processed by the Scrapy framework, typically passed to [Item Pipeline](#2.7-Item-Pipeline) components.

Here's what happens when you yield items in Scrapy:
- When you `yield` an item from a Spider callback method like `parse`, Scrapy will send that item to the Item Pipeline.
- The Item Pipeline is a mechanism for processing the items scraped by the Spider. It allows you to perform various tasks on the scraped items, such as cleaning, validation, and persistence.
- Each item that is `yield`ed is processed through the Item Pipeline sequentially, allowing you to define various stages of processing.
- The Item Pipeline can perform operations like validation and transformation on the items before they are saved to a storage backend such as a database, JSON file, or CSV file.
- You can define your own custom Item Pipeline to process the scraped items according to your requirements.

So, in summary, the yielded data from a Scrapy Spider is not stored directly within the Spider itself; instead, it's passed through the Item Pipeline for further processing and eventual storage.

## How to run our spider

To put our spider to work, go to the project’s top level directory and run:

```sh
scrapy crawl quotes
```
This command runs the spider with name `quotes` that we’ve just added, that will send some requests for the `quotes.toscrape.com` domain. You will get an output similar to this:

```
2024-02-05 01:57:09 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)
2024-02-05 01:57:09 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
2024-02-05 01:57:09 [scrapy.addons] INFO: Enabled addons:
[]
2024-02-05 01:57:09 [asyncio] DEBUG: Using selector: EpollSelector
2024-02-05 01:57:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-05 01:57:09 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-02-05 01:57:09 [scrapy.extensions.telnet] INFO: Telnet Password: b9a7a03404bf04d7
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-02-05 01:57:09 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-05 01:57:09 [scrapy.core.engine] INFO: Spider opened
2024-02-05 01:57:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-02-05 01:57:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-05 01:57:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2024-02-05 01:57:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2024-02-05 01:57:10 [quotes] DEBUG: Saved file quotes-1.html
2024-02-05 01:57:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2024-02-05 01:57:10 [quotes] DEBUG: Saved file quotes-2.html
2024-02-05 01:57:10 [scrapy.core.engine] INFO: Closing spider (finished)
2024-02-05 01:57:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 684,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 25556,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.235283,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 2, 4, 20, 57, 10, 482312, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 8,
 'log_count/INFO': 10,
 'memusage/max': 65585152,
 'memusage/startup': 65585152,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 2, 4, 20, 57, 9, 247029, tzinfo=datetime.timezone.utc)}
2024-02-05 01:57:10 [scrapy.core.engine] INFO: Spider closed (finished)
```

Now, check the files in the current directory. You should notice that two new files have been created: 
- quotes-1.html and 
- quotes-2.html, 

with the content for the respective URLs, as our parse method instructs:

In [10]:
ls ./tutorial

quotes-1.html  quotes-2.html  scrapy.cfg  [0m[01;34mtutorial[0m


In [12]:
cat ./tutorial/quotes-1.html

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

## What just happened under the hood?

Scrapy schedules the `scrapy.Request` objects returned by the `start_requests` method of the Spider. Upon receiving a `response` for each one, it instantiates `Response` objects and calls the callback method associated with the `request` (in this case, the `parse` method) passing the response as argument.

## A shortcut to the `start_requests` method

Instead of implementing a `start_requests()` method that generates `scrapy.Request` objects from URLs, you can just define a `start_urls` class attribute with a list of URLs. This list will then be used by the default implementation of `start_requests()` to create the initial requests for your spider.

```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```

The `parse()` method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because `parse()` is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

## Extracting data

The best way to learn how to extract data with Scrapy is trying **selectors** using the Scrapy shell. Run:

> Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls containing arguments (i.e. & character) will not work.<br>
</br>
On Windows, use double quotes instead:<br>
</br>
`scrapy shell "https://quotes.toscrape.com/page/1/"`

```sh
scrapy shell 'https://quotes.toscrape.com/page/1/'
```
```
2024-02-06 14:47:14 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)
2024-02-06 14:47:14 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
2024-02-06 14:47:14 [scrapy.addons] INFO: Enabled addons:
[]
2024-02-06 14:47:14 [asyncio] DEBUG: Using selector: EpollSelector
2024-02-06 14:47:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-06 14:47:14 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-02-06 14:47:14 [scrapy.extensions.telnet] INFO: Telnet Password: 53d4e3939b5fb7e7
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2024-02-06 14:47:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-06 14:47:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-06 14:47:14 [scrapy.core.engine] INFO: Spider opened
2024-02-06 14:47:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2024-02-06 14:47:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f5b3ece0cd0>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7f5b3ffddc10>
[s]   spider     <DefaultSpider 'default' at 0x7f5b3e7fa950>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2024-02-06 14:47:16 [asyncio] DEBUG: Using selector: EpollSelector
```
```ipython
In [1]: 
```

Using the shell, you can try selecting elements using [CSS](https://www.w3.org/TR/selectors) with the `response` object:

```ipython
In [1]: response.css("title")
Out[1]: [<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

In [2]: response.status
Out[2]: 200
```

### `get_all()` and `get()`

The result of running `response.css('title')` is a list-like object called **SelectorList**, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:

```ipython
In [9]: response.css("title::text").getall()
Out[9]: ['Quotes to Scrape']
```

There are two things to note here: one is that we’ve added `::text` to the CSS query, to mean we want to select only the text elements directly inside `<title>` element. If we don’t specify `::text`, we’d get the full title element, including its tags:

```ipython
In [11]: response.css("title").getall()
Out[11]: ['<title>Quotes to Scrape</title>']
```

The other thing is that the result of calling `.getall()` is a _list_: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

```ipython
In [12]: response.css("title::text").get()
Out[12]: 'Quotes to Scrape'
```

As an alternative, you could’ve written:

```ipython
In [16]: response.css("title::text")[0].get()
Out[16]: 'Quotes to Scrape'
```

Accessing an index on a SelectorList instance will raise an `IndexError` exception if there are no results. You might want to use `.get()` directly on the SelectorList instance instead, which returns `None` if there are no results:

```ipython
In [17]: response.css("noelement").get()
In [18]: response.css("noelement")[0].get()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
...
IndexError: list index out of range
```

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.

### `re()`

Besides the `getall()` and `get()` methods, you can also use the `re()` method to extract using [regular expressions](https://docs.python.org/3/library/re.html):

```ipython
In [20]: response.css("title::text").re(r".*uot.*")
Out[20]: ['Quotes to Scrape']

In [21]: response.css("title::text").re(r"Q\w+")
Out[21]: ['Quotes']

In [22]: response.css("title::text").re(r"(\w+) to (\w+)")
Out[22]: ['Quotes', 'Scrape']
```

- `\w` represents any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).

### `view(response)`

In order to find the proper CSS selectors to use, you might find it useful to open the `response` page from the shell in your web browser using `view(response)`. You can use your browser’s developer tools to inspect the HTML and come up with a selector (see [Using your browser’s Developer Tools for scraping](https://docs.scrapy.org/en/latest/topics/developer-tools.html#topics-developer-tools)).

[Selector Gadget](https://selectorgadget.com/) is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.

### `XPath`: a brief intro

See [XPath](../XPath_tutorial.ipynb#XPath).

Besides CSS, Scrapy selectors also support using XPath expressions:

```ipython
In [24]: response.xpath("//title")
Out[24]: [<Selector query='//title' data='<title>Quotes to Scrape</title>'>]

In [25]: response.xpath("//title/text()").get()
Out[25]: 'Quotes to Scrape'
```

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. [You can see that](#Extracting-data) if you read closely the text representation of the selector objects in the shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: _select the link that contains the text “Next Page”_. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using [XPath with Scrapy Selectors](https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors). To learn more about XPath, we recommend this [tutorial to learn XPath through examples](http://zvon.org/comp/r/tut-XPath_1.html), and this tutorial to learn [“how to think in XPath”](http://plasmasturm.org/log/xpath101/).

### Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in `https://quotes.toscrape.com` is represented by HTML elements that look like this:

```html
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
```

Let’s open up `scrapy shell` and play a bit to find out how to extract the data we want. We get a list of selectors for the quote HTML elements with:

```sh
scrapy shell 'https://quotes.toscrape.com'
```
```ipython
In [1]: response.css("div.quote")
Out[1]: 
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>]
```

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:

```ipython
In [3]: quote = response.css("div.quote")[0]
```

Now, let’s extract `text`, `author` and the `tags` from that quote using the `quote` object we just created:

```ipython
In [4]: text = quote.css("span.text::text").get()
In [5]: text
Out[5]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
```

Given that the `tags` are a list of strings, we can use the `.getall()` method to get all of them:

```ipython
In [6]: tags = quote.css("div.tags a.tag::text").getall()
In [7]: tags
Out[7]: ['change', 'deep-thoughts', 'thinking', 'world']
```

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary:

```ipython
In [8]: for quote in response.css("div.quote"):
   ...:     text = quote.css("span.text::text").get()
   ...:     author = quote.css("small.author::text").get()
   ...:     tags = quote.css("div.tags a.tag::text").getall()
   ...:     print(dict(text=text, author=author, tags=tags))
   ...: 
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
```

### Extracting data in our spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the `yield` Python keyword in the `callback`, as you can see below:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
```

To run this spider, exit the `scrapy shell` and run the crawler:

```sh
quit()
scrapy crawl quotes
```
```
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2024-02-06 23:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2024-02-06 23:53:53 [scrapy.core.engine] INFO: Closing spider (finished)
```

## Storing the scraped data

The simplest way to store the scraped data is by using [Feed exports](#2.8-Feed-exports), with the following command:

```sh
scrapy crawl quotes -O quotes.json
```

That will generate a `quotes.json` file containing all scraped items, serialized in JSON.

The `-O` command-line switch overwrites any existing file; use `-o` instead to append new content to any existing file. However, appending to a JSON file makes the file contents invalid JSON. When appending to a file, consider using a different serialization format, such as `JSON Lines`:

```sh
scrapy crawl quotes -o quotes.jsonl
```

The [JSON Lines format](http://jsonlines.org/) is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like `JQ` to help do that at the command-line.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an [Item Pipeline](#2.7-Item-Pipeline). A placeholder file for Item Pipelines has been set up for you when the project is created, in `tutorial/pipelines.py`. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

```sh
$ scrapy crawl -h
```
```
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
  -h, --help            show this help message and exit
  -a NAME=VALUE         set spider argument (may be repeated)
  -o FILE, --output FILE
                        append scraped items to the end of FILE (use - for stdout), to define format set a colon at the end of the output
                        URI (i.e. -o FILE:FORMAT)
  -O FILE, --overwrite-output FILE
                        dump scraped items into FILE, overwriting any existing file, to define format set a colon at the end of the
                        output URI (i.e. -O FILE:FORMAT)
  -t FORMAT, --output-format FORMAT
                        format to use for dumping items

Global Options
--------------
  --logfile FILE        log file. if omitted stderr will be used
  -L LEVEL, --loglevel LEVEL
                        log level (default: DEBUG)
  --nolog               disable logging completely
  --profile FILE        write python cProfile stats to FILE
  --pidfile FILE        write process ID to FILE
  -s NAME=VALUE, --set NAME=VALUE
                        set/override setting (may be repeated)
  --pdb                 enable pdb on failure
```

In [3]:
# man jq

## Following links

Let’s say, instead of just scraping the stuff from the first two pages from https://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let’s see how to follow links from them.

First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:

```html
<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>
```

We can try extracting it in the shell:

```ipython
In [1]: response.css("li.next a")
Out[1]: [<Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' next ')]/descendant-or-self::*/a" data='<a href="/page/2/">Next <span aria-hi...'>]

In [2]: response.css("li.next a").get()
Out[2]: '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
```

This gets the anchor element, but we want the attribute `href`. For that, Scrapy supports a CSS extension that lets you select the attribute contents, like this:

```ipython
In [3]: response.css("li.next a::attr(href)").get()
Out[3]: '/page/2/'
```
There is also an `attrib` property available (see [Selecting element attributes](#Selecting-element-attributes) for more):

```ipython
In [4]: response.css("li.next a").attrib["href"]
Out[4]: '/page/2/'
```

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
```

Now, after extracting the data, the `parse()` method looks for the link to the next page, builds a full absolute URL using the `urljoin()` method (since the links can be relative) and `yield`s a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you `yield` a `Request` in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with **pagination**.

### A shortcut for creating Requests

As a shortcut for creating `Request` objects you can use `response.follow`:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
```

Unlike `scrapy.Request`, `response.follow` supports relative URLs directly - no need to call `urljoin`. Note that `response.follow` just returns a `Request` instance; you still have to `yield` this `Request`.

You can also pass a selector to `response.follow` instead of a string; this selector should extract necessary attributes:

```python
for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)
```

For `<a>` elements there is a shortcut: `response.follow` uses their `href` attribute automatically. So the code can be shortened further:

```python
for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)
```

To create multiple requests from an iterable, you can use `response.follow_all` instead:

```python
anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)
```

or, shortening it further:

```python
yield from response.follow_all(css="ul.pager a", callback=self.parse)
```

## More examples and patterns

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

```python
import scrapy


class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        author_page_links = response.css(".author + a")
        yield from resonse.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        yield {
            "name": extract_with_css("h3.author-title::text"),
            "birthdate": extract_with_css(".author-born-date::text"),
            "bio": extract_with_css(".author-description::text"),
        }
```

This spider will start from the main page, it will follow all the links to the authors pages calling the `parse_author` callback for each of them, and also the pagination links with the `parse` callback as we saw before.

Here we’re passing callbacks to `response.follow_all` as positional arguments to make the code shorter; it also works for `Request`.

The `parse_author` callback defines a helper function to extract and cleanup the data from a CSS query and `yield`s the Python `dict` with the author data.

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting `DUPEFILTER_CLASS`.

Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links, check out the `CrawlSpider` class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page, using a trick to [pass additional data to the callbacks](#Passing-additional-data-to-callback-functions).

# 1.3 Using spider arguments

You can provide command line arguments to your spiders by using the `-a` option when running them:

```sh
scrapy crawl quotes -O quotes-humor.json -a tag=humor
```

These arguments are passed to the Spider’s `__init__` method and become spider attributes by default.

In this example, the value provided for the `tag` argument will be available via `self.tag`. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = "https://quotes.toscrape.com/"
        tag = getattr(self, "tag", None)
        if tag is not None:
            url += "tag/" + tag

        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span,text::text").get(),
                "author": qutote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
```

If you pass the `tag=humor` argument to this spider, you’ll notice that it will only visit URLs from the humor tag, such as `https://quotes.toscrape.com/tag/humor`.

You can learn more about [handling spider arguments here](#Spider-arguments).

# <b>2. Basic concepts</b>

# 2.1 Command line tool

Scrapy is controlled through the scrapy command-line tool, to be referred here as the **“Scrapy tool”** to differentiate it from the sub-commands, which we just call **“commands”** or **“Scrapy commands”**.

The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set of arguments and options.

(The scrapy deploy command has been removed in 1.0 in favor of the standalone `scrapyd-deploy`. See [Deploying your project](#https://scrapyd.readthedocs.io/en/latest/deploy.html).)

## Configuration settings

Scrapy will look for configuration parameters in ini-style `scrapy.cfg` files in standard locations:
- `/etc/scrapy.cfg` or `c:\scrapy\scrapy.cfg` (system-wide),
- `~/.config/scrapy.cfg` (`$XDG_CONFIG_HOME`) and `~/.scrapy.cfg` (`$HOME`) for global (user-wide) settings, and
- `scrapy.cfg` inside a Scrapy project’s root (see next section).

Settings from these files are merged in the listed order of preference: 
- user-defined values have higher priority than system-wide defaults and 
- project-wide settings will override all others, when defined.

Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:
- SCRAPY_SETTINGS_MODULE (see [Designating the settings](#Designating-the-settings))
- SCRAPY_PROJECT (see [Sharing the root directory between projects](#Sharing-the-root-directory-between-projects))
- SCRAPY_PYTHON_SHELL (see [Scrapy shell](#2.3-Scrapy-shell))

_ChatGPT:_  
Ini-style, short for "Initialization style," refers to a simple text-based file format used for configuration or initialization files in computing. It is named after the ".ini" file extension commonly associated with these types of files in Windows environments.

Ini-style files consist of sections, each containing key-value pairs, typically used to represent configuration settings for applications or systems. The structure is straightforward, with sections enclosed in square brackets "`[ ]`" and key-value pairs separated by an equals sign "`=`" or a colon "`:`".

### Default structure of Scrapy projects

Before delving into the command-line tool and its sub-commands, let’s first understand the directory structure of a Scrapy project.

Though it can be modified, all Scrapy projects have the same file structure by default, similar to this:

```
scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...
```

The directory where the `scrapy.cfg` file resides is known as the **project root directory**. That file contains the name of the python module that defines the project settings. Here is an example:

```
[settings]
default = myproject.settings
```

In [2]:
pwd

/home/commi/Yandex.Disk/it_learning/08_web_scraping/02_scrapy


In [5]:
tree data/tutorial

[01;34mdata/tutorial[0m
├── scrapy.cfg
└── [01;34mtutorial[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── [01;34m__pycache__[0m
    │   ├── __init__.cpython-311.pyc
    │   └── settings.cpython-311.pyc
    ├── settings.py
    └── [01;34mspiders[0m
        ├── __init__.py
        ├── [01;34m__pycache__[0m
        │   ├── __init__.cpython-311.pyc
        │   ├── quotes_spider.cpython-311.pyc
        │   └── tmp.cpython-311.pyc
        ├── quotes_spider.py
        └── tmp.py

5 directories, 14 files


### Sharing the root directory between projects

A project root directory, the one that contains the `scrapy.cfg`, may be shared by multiple Scrapy projects, each with its own settings module.

In that case, you must define one or more aliases for those settings modules under `[settings]` in your `scrapy.cfg` file:

```
[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings
```

In [6]:
cat data/tutorial/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = tutorial.settings

[deploy]
#url = http://localhost:6800/
project = tutorial


By default, the scrapy command-line tool will use the default settings. Use the `SCRAPY_PROJECT` environment variable to specify a different project for scrapy to use:

```sh
$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot
```

## Using the `scrapy` tool

### General info and help

You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands:

```sh
Scrapy X.Y - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  crawl         Run a spider
  fetch         Fetch a URL using the Scrapy downloader
[...]
```

The first line will print the currently active project if you’re inside a Scrapy project. In this example it was run from outside a project. If run from inside a project it would have printed something like this:

```sh
Scrapy X.Y - project: myproject

Usage:
  scrapy <command> [options] [args]

[...]
```

### Creating projects

The first thing you typically do with the `scrapy` tool is create your Scrapy project:

```sh
scrapy startproject myproject [project_dir]
```

That will create a Scrapy project under the `project_dir` directory. If `project_dir` wasn’t specified, `project_dir` will be the same as `myproject`.

Next, you go inside the new project directory:

``` sh
cd project_dir
```

And you’re ready to use the `scrapy` command to manage and control your project from there.

### Controlling projects

You use the `scrapy` tool _from inside_ your projects to control and manage them.

For example, to create a new spider:

```
scrapy genspider mydomain mydomain.com
```

Some Scrapy commands (like `crawl`) must be run from inside a Scrapy project. See the [commands reference](#Available-tool-commands) below for more information on which commands must be run from inside projects, and which not.

Also keep in mind that some commands may have slightly different behaviours when running them from inside projects. For example, the `fetch` command will use spider-overridden behaviours (such as the `user_agent` attribute to override the user-agent) if the url being fetched is associated with some specific spider. This is intentional, as the `fetch` command is meant to be used to check how spiders are downloading pages.

## Available tool commands

This section contains a list of the available built-in commands with a description and some usage examples. Remember, you can always get more info about each command by running:

```sh
scrapy <command> -h
```

And you can see all available commands with:

```sh
scrapy -h
```

There are two kinds of commands, 
- those that only work from inside a `Scrapy` project (**Project-specific commands**) and 
- those that also work without an active Scrapy project (**Global commands**), 

though they may behave slightly different when running from inside a project (as they would use the project overridden settings).

### Global commands

- `startproject`
- `genspider`
- `settings`
- `runspider`
- `shell`
- `fetch`
- `view`
- `version`

#### `startproject`

Syntax: `scrapy startproject <project_name> [project_dir]`

Requires project: no

Creates a new Scrapy project named `project_name`, under the `project_dir` directory. If `project_dir` wasn’t specified, `project_dir` will be the same as `project_name`.

Usage example:
```sh
$ scrapy startproject myproject
```

#### `genspider`

Syntax: `scrapy genspider [-t template] <name> <domain or URL>`

Requires project: no

New in version 2.6.0: The ability to pass a URL instead of a domain.

Create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The `<name>` parameter is set as the spider’s name, while `<domain or URL>` is used to generate the `allowed_domains` and `start_urls` spider’s attributes.

Usage example:

```sh
$ scrapy genspider -l
```
```
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
```
```sh
$ scrapy genspider example example.com
```
```
Created spider 'example' using template 'basic'
```
```sh
$ scrapy genspider -t crawl scrapyorg scrapy.org
```
```
Created spider 'scrapyorg' using template 'crawl'
```

This is just a convenience shortcut command for creating spiders based on pre-defined templates, but certainly not the only way to create spiders. 

> You can just create the spider source code files yourself, instead of using this command.

#### `fetch`

Syntax: `scrapy fetch <url>`

Requires project: no

Downloads the given URL using the Scrapy downloader and writes the contents to standard output.

The interesting thing about this command is that it fetches the page how the spider would download it. For example, if the spider has a `USER_AGENT` attribute which overrides the User Agent, it will use that one.

So this command can be used to “see” how your spider would fetch a certain page.

If used outside a project, no particular per-spider behaviour would be applied and it will just use the default Scrapy downloader settings.

Supported options:
- `--spider SPIDER`: bypass spider autodetection and force use of specific spider
- `--headers`: print the response’s HTTP headers instead of the response’s body
- `--no-redirect`: do not follow HTTP 3xx redirects (default is to follow them)

Usage examples:

```sh
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}
```

#### `shell`

Syntax: `scrapy shell [url]`

Requires project: no

Starts the Scrapy shell for the given URL (if given) or empty if no URL is given. Also supports UNIX-style local file paths, either relative with `./` or `../` prefixes or absolute file paths. See [Scrapy shell](#2.6-Scrapy-shell) for more info.

Supported options:
- `--spider=SPIDER`: bypass spider autodetection and force use of specific spider
- `-c code`: evaluate the code in the shell, print the result and exit
- `--no-redirect`: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url) will still follow HTTP redirects by default.

Usage example:

```sh
$ scrapy shell http://www.example.com/some/page.html
```
```
[ ... scrapy shell starts ... ]
```
```sh
$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
```
```
(200, 'http://www.example.com/')
```
```sh
# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
```
```
(200, 'http://example.com/')
```
```sh
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
```
```
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
```

#### `view`

Syntax: `scrapy view <url>`

Requires project: no

Opens the given URL in a browser, as your Scrapy spider would “see” it. Sometimes spiders see pages differently from regular users, so this can be used to check what the spider “sees” and confirm it’s what you expect.

Supported options:
- `--spider SPIDER`: bypass spider autodetection and force use of specific spider
- `--no-redirect`: do not follow HTTP 3xx redirects (default is to follow them)

Usage example:
```sh
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
```

#### `settings`

Syntax: `scrapy settings [options]`

Requires project: no

Get the value of a Scrapy setting.

If used inside a project it’ll show the project setting value, otherwise it’ll show the default Scrapy value for that setting.

Example usage:
```sh
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
```

#### `runspider`

Syntax: `scrapy runspider <spider_file.py>`

Requires project: no

Run a spider self-contained in a Python file, without having to create a project.

Example usage:
```sh
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
```

#### `version`

Syntax: `scrapy version [-v]`

Requires project: no

Prints the Scrapy version. If used with `-v` it also prints Python, Twisted and Platform info, which is useful for bug reports.

#### `bench`

Syntax: `scrapy bench`

Requires project: no

Run a quick benchmark test. [Benchmarking](#4.12-Benchmarking).

### Project-only commands

- `crawl`
- `check`
- `list`
- `edit`
- `parse`
- `bench`

#### `crawl`

Syntax: `scrapy crawl <spider>`

Requires project: yes

Start crawling using a spider.

Supported options:
- `-h`, `--help`: show a help message and exit
- `-a NAME=VALUE`: set a spider argument (may be repeated)
- `--output FILE` or `-o FILE`: append scraped items to the end of `FILE` (use - for stdout), to define format set a colon at the end of the output URI (i.e. `-o FILE:FORMAT`)
- `--overwrite-output FILE` or `-O FILE`: dump scraped items into `FILE`, overwriting any existing file, to define format set a colon at the end of the output URI (i.e. `-O FILE:FORMAT`)
- `--output-format FORMAT` or `-t FORMAT`: deprecated way to define format to use for dumping items, does not work in combination with `-O`

Usage examples:
```sh
$ scrapy crawl myspider
```
```
[ ... myspider starts crawling ... ]
```
```sh
$ scrapy crawl -o myfile:csv myspider
```
```
[ ... myspider starts crawling and appends the result to the file myfile in csv format ... ]
```
```sh
$ scrapy crawl -O myfile:json myspider
```
```
[ ... myspider starts crawling and saves the result in myfile in json format overwriting the original content... ]
```
```sh
$ scrapy crawl -o myfile -t csv myspider
```
```
[ ... myspider starts crawling and appends the result to the file myfile in csv format ... ]
```

#### check

Syntax: `scrapy check [options] <spider>`

Requires project: yes

Check spider contracts.

Options:

```
  -h, --help            show this help message and exit
  -l, --list            only list contracts, without checking them
  -v, --verbose         print contract tests for all spiders

Global Options
--------------
  --logfile FILE        log file. if omitted stderr will be used
  -L LEVEL, --loglevel LEVEL
                        log level (default: DEBUG)
  --nolog               disable logging completely
  --profile FILE        write python cProfile stats to FILE
  --pidfile FILE        write process ID to FILE
  -s NAME=VALUE, --set NAME=VALUE
                        set/override setting (may be repeated)
  --pdb                 enable pdb on failure
```

Usage examples:

```sh
$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
```

##### Spider conracts

_ChatGPT:_  
In Scrapy, `"spider contracts"` refer to a feature designed to enforce certain rules or constraints on the output of your spiders (the crawlers you create using Scrapy). These contracts are defined using the `scrapy.contracts` module and are meant to ensure that your spiders are behaving as expected and producing the desired output.

Spider contracts can be particularly useful for ensuring data quality and consistency, especially when you're dealing with large-scale web scraping projects where data structure or content might vary across different pages or domains.

Here's a brief overview of how spider contracts work in Scrapy:

- **Defining Contracts:** You define contracts using Python classes that subclass `scrapy.contracts.Contract`. Within these classes, you define methods to verify specific aspects of the spider output, such as the presence of certain fields or the structure of items.

- **Implementing Verification Logic:** Within the contract class methods, you write the logic to verify whether the spider output meets the defined criteria. This logic typically involves inspecting the scraped data and raising exceptions if the criteria are not met.

- **Enforcing Contracts:** Once you've defined your contracts, you can enable them for specific spiders by adding the contracts attribute to your spider classes and specifying which contracts to apply.

- **Running Contract Checks:** When you run your spiders, Scrapy will automatically apply the specified contracts to the spider output and perform the verification checks. If any of the checks fail, Scrapy will raise an exception, indicating that the spider output does not conform to the defined contracts.

By using spider contracts, you can ensure that your spiders are producing reliable and consistent output, which can be essential for downstream processing and analysis of the scraped data.

Here's a simple example of a spider contract class that verifies the presence of certain fields in the spider output:

```python
from scrapy.contracts import Contract


class RequiredFieldsContract(Contract):
    """Contract to verify the presence of required fields in spider output"""

    name = 'required_fields'

    def verify(self, output):
        required_fields = ['title', 'url', 'content']
        for item in output:
            for field in required_fields:
                if field not in item:
                    raise ContractFail(f"Required field '{field}' missing in item: {item}")
```

And here's how you would enable this contract for a specific spider:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    contracts = [RequiredFieldsContract]

    def parse(self, response):
        # Your parsing logic here
        pass
```

With this setup, Scrapy will verify that each item produced by MySpider contains the required fields specified in the `RequiredFieldsContract`. If any item fails this verification, Scrapy will raise a `ContractFail` exception.

#### `list`

Syntax: `scrapy list`

Requires project: yes

List all available spiders in the current project. The output is one spider per line.

Usage example:

```sh
$ scrapy list
spider1
spider2
```

#### `edit`

Syntax: `scrapy edit <spider>`

Requires project: yes

Edit the given spider using the editor defined in the `EDITOR` environment variable or (if unset) the `EDITOR` setting.

This command is provided only as a convenience shortcut for the most common case, the developer is of course free to choose any tool or IDE to write and debug spiders.

Usage example:
```sh
$ scrapy edit spider1
```

#### `parse`

Syntax: `scrapy parse <url> [options]`

Requires project: yes

Fetches the given URL and parses it with the spider that handles it, using the method passed with the `--callback` option, or parse if not given.

Supported options:
- `--spider=SPIDER`: bypass spider autodetection and force use of specific spider
- `--a NAME=VALUE`: set spider argument (may be repeated)
- `--callback` or `-c`: spider method to use as callback for parsing the response
- `--meta` or `-m`: additional request meta that will be passed to the callback request. This must be a valid json string. Example: `–meta=’{“foo” : “bar”}`’
- `--cbkwargs`: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: `–cbkwargs=’{“foo” : “bar”}`’
- `--pipelines`: process items through pipelines
- `--rules` or `-r`: use `CrawlSpider` rules to discover the callback (i.e. spider method) to use for parsing the response
- `--noitems`: don’t show scraped items
- `--nolinks`: don’t show extracted links
- `--nocolour`: avoid using pygments to colorize the output
- `--depth` or `-d`: depth level for which the requests should be followed recursively (default: 1)
- `--verbose` or `-v`: display information for each depth level
- `--output` or `-o`: dump scraped items to a file

New in version 2.3.

Usage example:
```sh
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': 'Example item',
 'category': 'Furniture',
 'length': '12 cm'}]

# Requests  -----------------------------------------------------------------
[]
```

## Custom project commands

You can also add your custom project commands by using the `COMMANDS_MODULE` setting. See the Scrapy commands in [scrapy/commands](https://github.com/scrapy/scrapy/tree/master/scrapy/commands) for examples on how to implement your commands.

### `COMMANDS_MODULE`

Default: `''` (empty string)

A module to use for looking up custom Scrapy commands. This is used to add custom commands for your Scrapy project.

Example:
```sh
COMMANDS_MODULE = "mybot.commands"
```

### Register commands via `setup.py` entry points

You can also add Scrapy commands from an external library by adding a `scrapy.commands` section in the entry points of the library `setup.py` file.

The following example adds `my_command` command:

```python
from setuptools import setup, find_packages

setup(
    name="scrapy-mymodule",
    entry_points={
        "scrapy.commands": [
            "my_command=my_scrapy_module.commands:MyCommand",
        ],
    },
)
```

# 2.2 Spiders

**Spiders** are classes which define how a certain site (or a group of sites) will be scraped, including 
- how to perform the crawl (i.e. follow links) and 
- how to extract structured data from their pages (i.e. scraping items). 

In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

For spiders, the scraping cycle goes through something like this:

1. You start by generating the initial **Requests** to crawl the first URLs, and specify a **callback function** to be called with the `response` downloaded from those requests.

    The first requests to perform are obtained by calling the `start_requests()` method which (by default) generates `Request` for the URLs specified in the `start_urls` and the `parse` method as callback function for the Requests.

1. In the callback function, you parse the **response** (web page) and return 
- [item objects](#2.4-Items), 
- `Request` objects, or 
- an iterable of these objects. 

    Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

3. In callback functions, you parse the page contents, typically using [Selectors](#2.3-Selectors) (but you can also use `BeautifulSoup`, `lxml` or whatever mechanism you prefer) and generate items with the parsed data.

4. Finally, the items returned from the spider will be typically 
- persisted to a database (in some [Item Pipeline](#2.7-Item-Pipeline)) or 
- written to a file using [Feed exports](#2.8-Feed-exports).

Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.

## `scrapy.Spider`

```python
class scrapy.spiders.Spider
class scrapy.Spider
```

This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). It doesn’t provide any special functionality. It just provides a default `start_requests()` implementation which sends requests from the `start_urls` spider attribute and calls the spider’s method `parse` for each of the resulting responses.

### `name`

A string which defines the name for this spider. 

The spider `name` is how the spider is located (and instantiated) by Scrapy, so **it must be unique**. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.

If the spider scrapes a single domain, a common practice is to name the spider after the domain, with or without the TLD. So, for example, a spider that crawls `mywebsite.com` would often be called `mywebsite`.

### `allowed_domains`

An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if `OffsiteMiddleware` is enabled.

Let’s say your target url is `https://www.example.com/1.html`, then add '`example.com`' to the list.

### `start_urls`

A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent `Request` will be generated successively from data contained in the start URLs.

### `custom_settings`

A dictionary of settings that will be overridden from the project wide configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation.

For a list of available built-in settings see: [Built-in settings reference](#Built-in-settings-reference).

### `crawler`

This attribute is set by the `from_crawler()` class method after initializing the class, and links to the `Crawler` object to which this spider instance is bound.

Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions, middlewares, signals managers, etc). See [Crawler API](#Crawler-API) to know more about them.

### `settings`

Configuration for running this spider. This is a `Settings` instance, see the [Settings](#2.11-Settings) topic for a detailed introduction on this subject.

### `logger`

Python logger created with the Spider’s name. You can use it to send log messages through it as described on [Logging from Spiders](#Logging-from-Spiders).

### `state`

A `dict` you can use to persist some spider state between batches. See [Keeping persistent state between batches](#Keeping-persistent-state-between-batches) to know more about it.

### `from_crawler(crawler, *args, **kwargs)`

This is the class method used by Scrapy to create your spiders.

You probably won’t need to override this directly because the default implementation acts as a proxy to the `__init__()` method, calling it with the given arguments `args` and named arguments `kwargs`.

Nonetheless, this method sets the crawler and settings attributes in the new instance so they can be accessed later inside the spider’s code.

_Changed in version 2.11:_ The settings in `crawler.settings` can now be modified in this method, which is handy if you want to modify them based on arguments. As a consequence, these settings aren’t the final values as they can be modified later by e.g. [add-ons](#5.2-Add-ons). For the same reason, most of the `Crawler` attributes aren’t initialized at this point.

The final settings and the initialized `Crawler` attributes are available in the `start_requests()` method, handlers of the `engine_started` signal and later.

**Parameters**
- `crawler` (Crawler instance) – crawler to which the spider will be bound
- `args` (list) – arguments passed to the `__init__()` method
- `kwargs` (dict) – keyword arguments passed to the `__init__()` method

### `classmethod update_settings(settings)`

The `update_settings()` method is used to modify the spider’s settings and is called during initialization of a spider instance.

It takes a `Settings` object as a parameter and can add or update the spider’s configuration values. This method is a class method, meaning that it is called on the Spider class and allows all instances of the spider to share the same configuration.

While per-spider settings can be set in `custom_settings`, using `update_settings()` allows you to dynamically 
- add, 
- remove or 
- change settings 

based on other 
- settings, 
- spider attributes or 
- other factors 

and use setting priorities other than 'spider'. Also, it’s easy to extend `update_settings()` in a subclass by overriding it, while doing the same with `custom_settings` can be hard.

For example, suppose a spider needs to modify `FEEDS`:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    custom_feed = {
        "/home/user/documents/items.json": {
            "format": "json",
            "indent": 4,
        }
    }

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings.setdefault("FEEDS", {}).update(cls.custom_feed)
```

### `parse(response)`

This is the default callback used by Scrapy to process downloaded `response`s, when their requests don’t specify a callback.

The `parse` method is in charge of processing the `response` and returning scraped data and/or more URLs to follow. Other `Request`s callbacks have the same requirements as the Spider class.

This method, as well as any other `Request` callback, must return 
- a `Request` object, 
- an [`item` object](#2.4-Items), 
- an iterable of `Request` objects and/or 
- [item objects](#2.4-Items), or 
- `None`.

**Parameters**
- `response` (Response) – the response to parse

### `log(message[, level, component])`

Wrapper that sends a log message through the Spider’s logger, kept for backward compatibility. For more information see [Logging from Spiders](#Logging-from-Spiders).

### `closed(reason)`

Called when the spider closes. This method provides a shortcut to `signals.connect()` for the `spider_closed` signal.

Let’s see an example:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/1.html",
        "http://www.example.com/2.html",
        "http://www.example.com/3.html",
    ]

    def parse(self, response):
        self.logger.info("A response from %s just arrived!", response.url)
```

Return multiple Requests and items from a single callback:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/1.html",
        "http://www.example.com/2.html",
        "http://www.example.com/3.html",
    ]

    def parse(self, response):
        for h3 in response.xpath("//h3").getall():
            yield {"title": h3}

        for href in response.xpath("//a/@href").getall():
            yield scrapy.Request(response.urljoin(href), self.parse)
```

Instead of `start_urls` you can use `start_requests()` directly; to give data more structure you can use `Item` objects:

```python
import scrapy
from myproject.items import MyItem


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]

    def start_requests(self):
        yield scrapy.Request("http://www.example.com/1.html", self.parse)
        yield scrapy.Request("http://www.example.com/2.html", self.parse)
        yield scrapy.Request("http://www.example.com/3.html", self.parse)

    def parse(self, response):
        for h3 in response.xpath("//h3").getall():
            yield MyItem(title=h3)

        for href in response.xpath("//a/@href").getall():
            yield scrapy.Request(response.urljoin(href), self.parse)
```

## Spider arguments

# 2.3 Selectors

## Selecting element attributes

# 2.4 Items

# 2.6 Scrapy shell

# 2.7 Item Pipeline

# 2.8 Feed exports

# 2.9 Requests and Responses

## Passing additional data to callback functions

# 2.11 Settings

## Designating the settings

## Built-in settings reference

# <b>3. Built-in Services</b>

# 3.1 Logging

## Logging from Spiders

# <b>4. Solving specific problems</b>

# 4.12 Benchmarking

# 4.13 Jobs: pausing and resuming crawls

## Keeping persistent state between batches

# <b>5. Extending Scrapy</b>

# 5.2 Add-ons

# 5.10 Core API

## Crawler API

# <b>(FORGET) Additional</b>

|bash|description|
|-|-|
|`scrapy startproject <name>`|start a new scrapy project|
|`scrapy genspider <spider_name> <domain>`|generate a spider in the `spider` dir|
|`scrapy runspider <spider_file>.py`|start the crawler|
|||
|||
|||

# 3. Creating a Scrapy project

You should work in the virtual environment.

```sh
pip install --upgrade pip
pip install scrapy
```

A **spider** is a Scrapy project that, like its arachnid namesake, is designed to crawl webs.

```sh
$ scrapy startproject test1
```
```
New Scrapy project 'test1', using template directory '/home/commi/venv/venv3.12/lib/python3.12/site-packages/scrapy/templates/project', created in:
    /home/commi/Yandex.Disk/it_learning/08_parsing_data/data/test1

You can start your first spider with:
    cd test1
    scrapy genspider example example.com
```

## Project dir

In [35]:
cd /home/commi/Yandex.Disk/it_learning/08_parsing_data/data/

In [36]:
tree test1

[01;34mtest1[0m
├── scrapy.cfg
└── [01;34mtest1[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


In [37]:
cat test1/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = test1.settings

[deploy]
#url = http://localhost:6800/
project = test1


### Deeper

In [38]:
tree test1/test1

[01;34mtest1/test1[0m
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── [01;34mspiders[0m
    └── __init__.py

2 directories, 6 files


In [39]:
cat test1/test1/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Test1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [40]:
cat test1/test1/middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class Test1SpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spide

In [41]:
cat test1/test1/pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class Test1Pipeline:
    def process_item(self, item, spider):
        return item


In [42]:
cat test1/test1/settings.py

# Scrapy settings for test1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "test1"

SPIDER_MODULES = ["test1.spiders"]
NEWSPIDER_MODULE = "test1.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "test1 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay settin

### Even deeper

In [43]:
tree test1/test1/spiders

[01;34mtest1/test1/spiders[0m
└── __init__.py

1 directory, 1 file


In [44]:
cat test1/test1/spiders/__init__.py

# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


# 4. Write a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at test1/test1/spiders/bookspider.py.

```sh
$ cd test1/test1/spiders/
$ scrapy genspider bookspider books.toscrape.com
```
```
Created spider 'bookspider' using template 'basic' in module:
  test1.spiders.bookspider
```

In [45]:
tree test1/test1/spiders/

[01;34mtest1/test1/spiders/[0m
├── bookspider.py
├── __init__.py
└── [01;34m__pycache__[0m
    └── __init__.cpython-312.pyc

2 directories, 3 files


In [46]:
cat test1/test1/spiders/bookspider.py

import scrapy


class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        pass
