# SCRAPY [DOCUMENTATION](https://docs.scrapy.org/en/latest/index.html)

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

# Commands

## Scrapy command line

|Scrapy command line|Description|
|-|-|
|**Help**||
|`scrapy -h`|(`--help`) list all the available commands (run from the project's directory)|
|`scrapy <command> -h`|help on the given command|
|**List info**||
|`scrapy list`|list all available crawlers (run from the project's deirectory)|
|**Crawl a random webpage in shell**||
|`scrapy shell <url>`||
|(inside shell) `view(response)`|open the resonse object in your browser|
|**Project**||
|`scrapy startproject project_name [project_dir]`|create a new project|
|`scrapy genspider [-t template] <name> <domain or URL>`|Create a new spider in the current folder or in the current project’s spiders folder,|
|`scrapy crawl spider_name`|run the spider (from the project's top level dir)|
|`scrapy crawl spider_name -s JOBDIR=crawls/spider_name-1`|pausing and resuming crawls|
|||
|||
|||

```
scrapy --help
Scrapy 2.11.0 - active project: tutorial

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
```

## Scrapy extraction most common tools

|Scrapy extraction tools|Description|
|-|-|
|`view(response)`|open the response page from the shell in your web browser|
|**Response status codes**||
|`response.status`||
|**CSS selectors**||
|`response.css`||
|`response.css("title::text").getall()`|get only text from the SelectorList|
|||
|||
|||

# <b>1. Scrapy tutorial</b>

This tutorial will walk you through these tasks:

- Creating a new Scrapy project
- Writing a spider to crawl a site and extract data
- Exporting the scraped data using the command line
- Changing spider to recursively follow links
- Using spider arguments

# 1.0 Installation guide

We strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system packages.

```sh
(venv) $ pip install Scrapy
```

Scrapy is written in pure Python and depends on a few key Python packages (among others):

- lxml, an efficient XML and HTML parser
- parsel, an HTML/XML data extraction library written on top of lxml,
- w3lib, a multi-purpose helper for dealing with URLs and web page encodings
- twisted, an asynchronous networking framework
- cryptography and pyOpenSSL, to deal with various network-level security needs

Some of these packages themselves depend on non-Python packages that might require additional installation steps depending on your platform. Please check [platform-specific guides](https://docs.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes).

In case of any trouble related to these dependencies, please refer to their respective installation instructions:

- [lxml installation](https://lxml.de/installation.html)
- [cryptography installation](https://cryptography.io/en/latest/installation/)

# 1.1 Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

```sh
scrapy startproject tutorial
```
```
New Scrapy project 'tutorial', using template directory '/home/commi/venv/venv3.11/lib/python3.11/site-packages/scrapy/templates/project', created in:
    /home/commi/Yandex.Disk/it_learning/08_web_scraping/02_scrapy/data/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
```

This will create a `tutorial` directory with the following contents:

In [2]:
cd ./data
tree tutorial

[01;34mtutorial[0m
├── scrapy.cfg
└── [01;34mtutorial[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


```
tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py
```

# 1.2 Our first Spider

**Spiders** are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `Spider` and define the initial requests to make, optionally 
- how to follow links in the pages, and 
- how to parse the downloaded page content to extract data.

This is the code for our first `Spider`. Save it in a file named `quotes_spider.py` under the `tutorial/spiders` directory in your project:

In [5]:
ls -R

.:
draft.py  [0m[01;34m__pycache__[0m  quotes.jsonl  quotes_spider.py  [01;34mtutorial[0m

./__pycache__:
draft.cpython-311.pyc

./tutorial:
scrapy.cfg  [01;34mtutorial[0m

./tutorial/tutorial:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  [01;34mspiders[0m

./tutorial/tutorial/spiders:
__init__.py


```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```

As you can see, our Spider subclasses `scrapy.Spider` and defines some attributes and methods:

- `name`: identifies the Spider. It **must be unique within a project**, that is, you can’t set the same name for different Spiders.

- `start_requests()`: must return an iterable of `Request`s (you can return a **list** of requests or write a **generator** function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

- `parse()`: a method that will be called to handle the response downloaded for each of the requests made. The `response` parameter is an instance of `TextResponse` that holds the page content and has further helpful methods to handle it.

The `parse()` method usually parses the `response`, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

_ChatGPT:_  
In Scrapy, when you use the `yield` statement within a Spider callback method like `parse`, the yielded items are not stored directly. Instead, they are processed by the Scrapy framework, typically passed to [Item Pipeline](#2.7-Item-Pipeline) components.

Here's what happens when you yield items in Scrapy:
- When you `yield` an item from a Spider callback method like `parse`, Scrapy will send that item to the Item Pipeline.
- The Item Pipeline is a mechanism for processing the items scraped by the Spider. It allows you to perform various tasks on the scraped items, such as cleaning, validation, and persistence.
- Each item that is `yield`ed is processed through the Item Pipeline sequentially, allowing you to define various stages of processing.
- The Item Pipeline can perform operations like validation and transformation on the items before they are saved to a storage backend such as a database, JSON file, or CSV file.
- You can define your own custom Item Pipeline to process the scraped items according to your requirements.

So, in summary, the yielded data from a Scrapy Spider is not stored directly within the Spider itself; instead, it's passed through the Item Pipeline for further processing and eventual storage.

In Scrapy, you should use the `yield` statement instead of creating a list to output items because Scrapy operates asynchronously and uses generators for item processing. There are several reasons why `yield` is preferred over creating lists:

- **Memory Efficiency**: Using `yield` allows Scrapy to process items one at a time as they are generated, which is more memory-efficient compared to storing all items in a list. This is particularly beneficial when dealing with a large number of items or when scraping large websites.

- **Asynchronous Processing**: Scrapy is built on top of Twisted, an asynchronous networking framework. By using generators and yield, Scrapy can process items asynchronously, allowing it to handle multiple requests and responses concurrently. This improves the overall performance and scalability of your spider.

- **Streaming Response**: When you `yield` items in Scrapy, they are immediately sent to downstream components such as pipelines or middleware for further processing. This enables a streaming-like behavior where items can be processed in real-time as they are generated, rather than waiting for the entire list to be created.

- **Pipeline Integration**: Scrapy pipelines expect items to be yielded one by one. By using yield, you can seamlessly integrate your spider with pipelines to perform various tasks such as data validation, cleaning, and storing without having to manage lists of items manually.

- **Reduced Complexity**: Using `yield` simplifies the code and makes it more readable compared to managing lists of items. It follows the principle of generator-based programming, where you produce items as needed, resulting in cleaner and more maintainable code.

Overall, using yield in Scrapy spiders aligns with the framework's design principles and asynchronous nature, leading to better performance, scalability, and code clarity.

## How to run our spider

To put our spider to work, go to the project’s top level directory and run:

```sh
scrapy crawl quotes
```
This command runs the spider with name `quotes` that we’ve just added, that will send some requests for the `quotes.toscrape.com` domain. You will get an output similar to this:

```
2024-02-05 01:57:09 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)
2024-02-05 01:57:09 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
2024-02-05 01:57:09 [scrapy.addons] INFO: Enabled addons:
[]
2024-02-05 01:57:09 [asyncio] DEBUG: Using selector: EpollSelector
2024-02-05 01:57:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-05 01:57:09 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-02-05 01:57:09 [scrapy.extensions.telnet] INFO: Telnet Password: b9a7a03404bf04d7
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-02-05 01:57:09 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-05 01:57:09 [scrapy.core.engine] INFO: Spider opened
2024-02-05 01:57:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-02-05 01:57:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-05 01:57:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2024-02-05 01:57:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2024-02-05 01:57:10 [quotes] DEBUG: Saved file quotes-1.html
2024-02-05 01:57:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2024-02-05 01:57:10 [quotes] DEBUG: Saved file quotes-2.html
2024-02-05 01:57:10 [scrapy.core.engine] INFO: Closing spider (finished)
2024-02-05 01:57:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 684,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 25556,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.235283,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 2, 4, 20, 57, 10, 482312, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 8,
 'log_count/INFO': 10,
 'memusage/max': 65585152,
 'memusage/startup': 65585152,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 2, 4, 20, 57, 9, 247029, tzinfo=datetime.timezone.utc)}
2024-02-05 01:57:10 [scrapy.core.engine] INFO: Spider closed (finished)
```

Now, check the files in the current directory. You should notice that two new files have been created: 
- quotes-1.html and 
- quotes-2.html, 

with the content for the respective URLs, as our parse method instructs:

In [10]:
ls ./tutorial

quotes-1.html  quotes-2.html  scrapy.cfg  [0m[01;34mtutorial[0m


In [12]:
cat ./tutorial/quotes-1.html

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

## What has just happened under the hood?

Scrapy schedules the `scrapy.Request` objects returned by the `start_requests` method of the Spider. Upon receiving a `response` for each one, it instantiates `Response` objects and calls the callback method associated with the `request` (in this case, the `parse` method) passing the response as argument.

## A shortcut to the `start_requests` method

Instead of implementing a `start_requests()` method that generates `scrapy.Request` objects from URLs, you can just define a `start_urls` class attribute with a list of URLs. This list will then be used by the default implementation of `start_requests()` to create the initial requests for your spider.

```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```

The `parse()` method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because `parse()` is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

## Extracting data

The best way to learn how to extract data with Scrapy is trying **selectors** using the Scrapy shell. Run:

> Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls containing arguments (i.e. & character) will not work.<br>
</br>
On Windows, use double quotes instead:<br>
</br>
`scrapy shell "https://quotes.toscrape.com/page/1/"`

```sh
scrapy shell 'https://quotes.toscrape.com/page/1/'
```
```
2024-02-06 14:47:14 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)
2024-02-06 14:47:14 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
2024-02-06 14:47:14 [scrapy.addons] INFO: Enabled addons:
[]
2024-02-06 14:47:14 [asyncio] DEBUG: Using selector: EpollSelector
2024-02-06 14:47:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-06 14:47:14 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-02-06 14:47:14 [scrapy.extensions.telnet] INFO: Telnet Password: 53d4e3939b5fb7e7
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2024-02-06 14:47:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-06 14:47:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-06 14:47:14 [scrapy.core.engine] INFO: Spider opened
2024-02-06 14:47:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2024-02-06 14:47:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f5b3ece0cd0>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7f5b3ffddc10>
[s]   spider     <DefaultSpider 'default' at 0x7f5b3e7fa950>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2024-02-06 14:47:16 [asyncio] DEBUG: Using selector: EpollSelector
```
```ipython
In [1]: 
```

Using the shell, you can try selecting elements using [CSS](https://www.w3.org/TR/selectors) with the `response` object:

```ipython
In [1]: response.css("title")
Out[1]: [<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

In [2]: response.status
Out[2]: 200
```

### `get_all()` and `get()`

The result of running `response.css('title')` is a list-like object called **SelectorList**, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:

```ipython
In [9]: response.css("title::text").getall()
Out[9]: ['Quotes to Scrape']
```

There are two things to note here: one is that we’ve added `::text` to the CSS query, to mean we want to select only the text elements directly inside `<title>` element. If we don’t specify `::text`, we’d get the full title element, including its tags:

```ipython
In [11]: response.css("title").getall()
Out[11]: ['<title>Quotes to Scrape</title>']
```

The other thing is that the result of calling `.getall()` is a _list_: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

```ipython
In [12]: response.css("title::text").get()
Out[12]: 'Quotes to Scrape'
```

As an alternative, you could’ve written:

```ipython
In [16]: response.css("title::text")[0].get()
Out[16]: 'Quotes to Scrape'
```

Accessing an index on a SelectorList instance will raise an `IndexError` exception if there are no results. You might want to use `.get()` directly on the SelectorList instance instead, which returns `None` if there are no results:

```ipython
In [17]: response.css("noelement").get()
In [18]: response.css("noelement")[0].get()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
...
IndexError: list index out of range
```

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.

### `re()`

Besides the `getall()` and `get()` methods, you can also use the `re()` method to extract using [regular expressions](https://docs.python.org/3/library/re.html):

```ipython
In [20]: response.css("title::text").re(r".*uot.*")
Out[20]: ['Quotes to Scrape']

In [21]: response.css("title::text").re(r"Q\w+")
Out[21]: ['Quotes']

In [22]: response.css("title::text").re(r"(\w+) to (\w+)")
Out[22]: ['Quotes', 'Scrape']
```

- `\w` represents any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).

### `view(response)`

In order to find the proper CSS selectors to use, you might find it useful to open the `response` page from the shell in your web browser using `view(response)`. You can use your browser’s developer tools to inspect the HTML and come up with a selector (see [Using your browser’s Developer Tools for scraping](https://docs.scrapy.org/en/latest/topics/developer-tools.html#topics-developer-tools)).

[Selector Gadget](https://selectorgadget.com/) is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.

### `XPath`: a brief intro

See [XPath](../XPath_tutorial.ipynb#XPath).

Besides CSS, Scrapy selectors also support using XPath expressions:

```ipython
In [24]: response.xpath("//title")
Out[24]: [<Selector query='//title' data='<title>Quotes to Scrape</title>'>]

In [25]: response.xpath("//title/text()").get()
Out[25]: 'Quotes to Scrape'
```

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. [You can see that](#Extracting-data) if you read closely the text representation of the selector objects in the shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: _select the link that contains the text “Next Page”_. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using [XPath with Scrapy Selectors](https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors). To learn more about XPath, we recommend this [tutorial to learn XPath through examples](http://zvon.org/comp/r/tut-XPath_1.html), and this tutorial to learn [“how to think in XPath”](http://plasmasturm.org/log/xpath101/).

### Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in `https://quotes.toscrape.com` is represented by HTML elements that look like this:

```html
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
```

Let’s open up `scrapy shell` and play a bit to find out how to extract the data we want. We get a list of selectors for the quote HTML elements with:

```sh
scrapy shell 'https://quotes.toscrape.com'
```
```ipython
In [1]: response.css("div.quote")
Out[1]: 
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>]
```

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:

```ipython
In [3]: quote = response.css("div.quote")[0]
```

Now, let’s extract `text`, `author` and the `tags` from that quote using the `quote` object we just created:

```ipython
In [4]: text = quote.css("span.text::text").get()
In [5]: text
Out[5]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
```

Given that the `tags` are a list of strings, we can use the `.getall()` method to get all of them:

```ipython
In [6]: tags = quote.css("div.tags a.tag::text").getall()
In [7]: tags
Out[7]: ['change', 'deep-thoughts', 'thinking', 'world']
```

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary:

```ipython
In [8]: for quote in response.css("div.quote"):
   ...:     text = quote.css("span.text::text").get()
   ...:     author = quote.css("small.author::text").get()
   ...:     tags = quote.css("div.tags a.tag::text").getall()
   ...:     print(dict(text=text, author=author, tags=tags))
   ...: 
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
```

### Extracting data in our spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the `yield` Python keyword in the `callback`, as you can see below:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
```

To run this spider, exit the `scrapy shell` and run the crawler:

```sh
quit()
scrapy crawl quotes
```
```
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2024-02-06 23:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2024-02-06 23:53:53 [scrapy.core.engine] INFO: Closing spider (finished)
```

## Storing the scraped data

The simplest way to store the scraped data is by using [Feed exports](#2.8-Feed-exports), with the following command:

```sh
scrapy crawl quotes -O quotes.json
```

That will generate a `quotes.json` file containing all scraped items, serialized in JSON.

The `-O` command-line switch overwrites any existing file; use `-o` instead to append new content to any existing file. However, appending to a JSON file makes the file contents invalid JSON. When appending to a file, consider using a different serialization format, such as `JSON Lines`:

```sh
scrapy crawl quotes -o quotes.jsonl
```

The [JSON Lines format](http://jsonlines.org/) is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like `JQ` to help do that at the command-line.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an [Item Pipeline](#2.7-Item-Pipeline). A placeholder file for Item Pipelines has been set up for you when the project is created, in `tutorial/pipelines.py`. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

```sh
$ scrapy crawl -h
```
```
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
  -h, --help            show this help message and exit
  -a NAME=VALUE         set spider argument (may be repeated)
  -o FILE, --output FILE
                        append scraped items to the end of FILE (use - for stdout), to define format set a colon at the end of the output
                        URI (i.e. -o FILE:FORMAT)
  -O FILE, --overwrite-output FILE
                        dump scraped items into FILE, overwriting any existing file, to define format set a colon at the end of the
                        output URI (i.e. -O FILE:FORMAT)
  -t FORMAT, --output-format FORMAT
                        format to use for dumping items

Global Options
--------------
  --logfile FILE        log file. if omitted stderr will be used
  -L LEVEL, --loglevel LEVEL
                        log level (default: DEBUG)
  --nolog               disable logging completely
  --profile FILE        write python cProfile stats to FILE
  --pidfile FILE        write process ID to FILE
  -s NAME=VALUE, --set NAME=VALUE
                        set/override setting (may be repeated)
  --pdb                 enable pdb on failure
```

In [3]:
# man jq

## Following links

Let’s say, instead of just scraping the stuff from the first two pages from https://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let’s see how to follow links from them.

First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:

```html
<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>
```

We can try extracting it in the shell:

```ipython
In [1]: response.css("li.next a")
Out[1]: [<Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' next ')]/descendant-or-self::*/a" data='<a href="/page/2/">Next <span aria-hi...'>]

In [2]: response.css("li.next a").get()
Out[2]: '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
```

This gets the anchor element, but we want the attribute `href`. For that, Scrapy supports a CSS extension that lets you select the attribute contents, like this:

```ipython
In [3]: response.css("li.next a::attr(href)").get()
Out[3]: '/page/2/'
```
There is also an `attrib` property available (see [Selecting element attributes](#Selecting-element-attributes) for more):

```ipython
In [4]: response.css("li.next a").attrib["href"]
Out[4]: '/page/2/'
```

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
```

Now, after extracting the data, the `parse()` method looks for the link to the next page, builds a full absolute URL using the `urljoin()` method (since the links can be relative) and `yield`s a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you `yield` a `Request` in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with **pagination**.

### A shortcut for creating Requests

As a shortcut for creating `Request` objects you can use `response.follow`:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
```

Unlike `scrapy.Request`, `response.follow` supports relative URLs directly - no need to call `urljoin`. Note that `response.follow` just returns a `Request` instance; you still have to `yield` this `Request`.

You can also pass a selector to `response.follow` instead of a string; this selector should extract necessary attributes:

```python
for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)
```

For `<a>` elements there is a shortcut: `response.follow` uses their `href` attribute automatically. So the code can be shortened further:

```python
for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)
```

To create multiple requests from an iterable, you can use `response.follow_all` instead:

```python
anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)
```

or, shortening it further:

```python
yield from response.follow_all(css="ul.pager a", callback=self.parse)
```

## More examples and patterns

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

```python
import scrapy


class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        author_page_links = response.css(".author + a")
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        yield {
            "name": extract_with_css("h3.author-title::text"),
            "birthdate": extract_with_css(".author-born-date::text"),
            "bio": extract_with_css(".author-description::text"),
        }
```

This spider will start from the main page, it will follow all the links to the authors pages calling the `parse_author` callback for each of them, and also the pagination links with the `parse` callback as we saw before.

Here we’re passing callbacks to `response.follow_all` as positional arguments to make the code shorter; it also works for `Request`.

The `parse_author` callback defines a helper function to extract and cleanup the data from a CSS query and `yield`s the Python `dict` with the author data.

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting `DUPEFILTER_CLASS`.

Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links, check out the `CrawlSpider` class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page, using a trick to [pass additional data to the callbacks](#Passing-additional-data-to-callback-functions).

# 1.3 Using spider arguments

You can provide command line arguments to your spiders by using the `-a` option when running them:

```sh
scrapy crawl quotes -O quotes-humor.json -a tag=humor
```

These arguments are passed to the Spider’s `__init__` method and become spider attributes by default.

In this example, the value provided for the `tag` argument will be available via `self.tag`. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = "https://quotes.toscrape.com/"
        tag = getattr(self, "tag", None)
        if tag is not None:
            url += "tag/" + tag

        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span,text::text").get(),
                "author": qutote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
```

If you pass the `tag=humor` argument to this spider, you’ll notice that it will only visit URLs from the humor tag, such as `https://quotes.toscrape.com/tag/humor`.

You can learn more about [handling spider arguments here](#Spider-arguments).

# <b>2. Basic concepts</b>

# Spider vs Crawler

_ChatGPT_:  
In Scrapy, "Crawler" and "Spider" are closely related but distinct concepts.

- **Spider**: A spider is a class in Scrapy that defines how a particular website (or a group of websites) will be scraped. It contains the logic to extract data from web pages, including how to follow links within those pages. Each spider is responsible for a specific website or set of websites and defines the scraping behavior, such as what pages to scrape, how to extract data, and how to follow links.

- **Crawler**: The term "crawler" is often used interchangeably with "spider" in the context of web scraping. However, in Scrapy, the term "crawler" refers to the entire scraping process, which includes the spiders as well as other components such as the request scheduler, the item pipeline, and settings. The crawler manages the execution of spiders, handles requests and responses, and orchestrates the entire scraping process.

In summary, a spider is a specific component within a Scrapy project responsible for defining how data will be extracted from a website, while a crawler is the broader term encompassing the entire scraping process, including the spiders, request management, and other related functionalities.

# 2.1 Command line tool

Scrapy is controlled through the scrapy command-line tool, to be referred here as the **“Scrapy tool”** to differentiate it from the sub-commands, which we just call **“commands”** or **“Scrapy commands”**.

The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set of arguments and options.

(The scrapy deploy command has been removed in 1.0 in favor of the standalone `scrapyd-deploy`. See [Deploying your project](#https://scrapyd.readthedocs.io/en/latest/deploy.html).)

## Configuration settings

Scrapy will look for configuration parameters in ini-style `scrapy.cfg` files in standard locations:
- `/etc/scrapy.cfg` or `c:\scrapy\scrapy.cfg` (system-wide),
- `~/.config/scrapy.cfg` (`$XDG_CONFIG_HOME`) and `~/.scrapy.cfg` (`$HOME`) for global (user-wide) settings, and
- `scrapy.cfg` inside a Scrapy project’s root (see next section).

Settings from these files are merged in the listed order of preference: 
- user-defined values have higher priority than system-wide defaults and 
- project-wide settings will override all others, when defined.

Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:
- `SCRAPY_SETTINGS_MODULE` (see [Designating the settings](#Designating-the-settings))
- `SCRAPY_PROJECT` (see [Sharing the root directory between projects](#Sharing-the-root-directory-between-projects))
- `SCRAPY_PYTHON_SHELL` (see [Scrapy shell](#2.3-Scrapy-shell))

_ChatGPT:_  
Ini-style, short for "Initialization style," refers to a simple text-based file format used for configuration or initialization files in computing. It is named after the ".ini" file extension commonly associated with these types of files in Windows environments.

Ini-style files consist of sections, each containing key-value pairs, typically used to represent configuration settings for applications or systems. The structure is straightforward, with sections enclosed in square brackets "`[ ]`" and key-value pairs separated by an equals sign "`=`" or a colon "`:`".

### Default structure of Scrapy projects

Before delving into the command-line tool and its sub-commands, let’s first understand the directory structure of a Scrapy project.

Though it can be modified, all Scrapy projects have the same file structure by default, similar to this:

```python
scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...
```

The directory where the `scrapy.cfg` file resides is known as the **project root directory**. That file contains the name of the python module that defines the project settings. Here is an example:

```python
[settings]
default = myproject.settings
```

In [5]:
tree data/tutorial

[01;34mdata/tutorial[0m
├── scrapy.cfg
└── [01;34mtutorial[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── [01;34m__pycache__[0m
    │   ├── __init__.cpython-311.pyc
    │   └── settings.cpython-311.pyc
    ├── settings.py
    └── [01;34mspiders[0m
        ├── __init__.py
        ├── [01;34m__pycache__[0m
        │   ├── __init__.cpython-311.pyc
        │   ├── quotes_spider.cpython-311.pyc
        │   └── tmp.cpython-311.pyc
        ├── quotes_spider.py
        └── tmp.py

5 directories, 14 files


### Sharing the root directory between projects

A project root directory, the one that contains the `scrapy.cfg`, may be shared by multiple Scrapy projects, each with its own settings module.

In that case, you must define one or more aliases for those settings modules under `[settings]` in your `scrapy.cfg` file:

```
[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings
```

In [6]:
cat data/tutorial/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = tutorial.settings

[deploy]
#url = http://localhost:6800/
project = tutorial


By default, the scrapy command-line tool will use the default settings. Use the `SCRAPY_PROJECT` environment variable to specify a different project for scrapy to use:

```sh
$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot
```

## Using the `scrapy` tool

### General info and help

You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands:

```sh
$ scrapy
Scrapy X.Y - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  crawl         Run a spider
  fetch         Fetch a URL using the Scrapy downloader
[...]
```

The first line will print the currently active project if you’re inside a Scrapy project. In this example it was run from outside a project. If run from inside a project it would have printed something like this:

```sh
Scrapy X.Y - project: myproject

Usage:
  scrapy <command> [options] [args]

[...]
```

### Creating projects

The first thing you typically do with the `scrapy` tool is create your Scrapy project:

```sh
scrapy startproject myproject [project_dir]
```

That will create a Scrapy project under the `project_dir` directory. If `project_dir` wasn’t specified, `project_dir` will be the same as `myproject`.

Next, you go inside the new project directory:

``` sh
cd project_dir
```

And you’re ready to use the `scrapy` command to manage and control your project from there.

### Controlling projects

You use the `scrapy` tool _from inside_ your projects to control and manage them.

For example, to create a new spider:

```
scrapy genspider mydomain mydomain.com
```

Some Scrapy commands (like `crawl`) must be run from inside a Scrapy project. See the [commands reference](#Available-tool-commands) below for more information on which commands must be run from inside projects, and which not.

Also keep in mind that some commands may have slightly different behaviours when running them from inside projects. For example, the `fetch` command will use spider-overridden behaviours (such as the `user_agent` attribute to override the user-agent) if the url being fetched is associated with some specific spider. This is intentional, as the `fetch` command is meant to be used to check how spiders are downloading pages.

## Available tool commands

This section contains a list of the available built-in commands with a description and some usage examples. Remember, you can always get more info about each command by running:

```sh
scrapy <command> -h
```

And you can see all available commands with:

```sh
scrapy -h
```

There are two kinds of commands, 
- those that only work from inside a `Scrapy` project (**Project-specific commands**) and 
- those that also work without an active Scrapy project (**Global commands**), 

though they may behave slightly different when running from inside a project (as they would use the project overridden settings).

### Global commands

- `startproject`
- `genspider`
- `settings`
- `runspider`
- `shell`
- `fetch`
- `view`
- `version`

#### `startproject`

Syntax: `scrapy startproject <project_name> [project_dir]`

Requires project: no

Creates a new Scrapy project named `project_name`, under the `project_dir` directory. If `project_dir` wasn’t specified, `project_dir` will be the same as `project_name`.

Usage example:
```sh
$ scrapy startproject myproject
```

#### `genspider`

Syntax: `scrapy genspider [-t template] <name> <domain or URL>`

Requires project: no

New in version 2.6.0: The ability to pass a URL instead of a domain.

Create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The `<name>` parameter is set as the spider’s name, while `<domain or URL>` is used to generate the `allowed_domains` and `start_urls` spider’s attributes.

Usage example:

```sh
$ scrapy genspider -l
```
```
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
```
```sh
$ scrapy genspider example example.com
```
```
Created spider 'example' using template 'basic'
```
```sh
$ scrapy genspider -t crawl scrapyorg scrapy.org
```
```
Created spider 'scrapyorg' using template 'crawl'
```

This is just a convenience shortcut command for creating spiders based on pre-defined templates, but certainly not the only way to create spiders. 

> You can just create the spider source code files yourself, instead of using this command.

#### `fetch`

Syntax: `scrapy fetch <url>`

Requires project: no

Downloads the given URL using the Scrapy downloader and writes the contents to standard output.

The interesting thing about this command is that it fetches the page how the spider would download it. For example, if the spider has a `USER_AGENT` attribute which overrides the User Agent, it will use that one.

So this command can be used to “see” how your spider would fetch a certain page.

If used outside a project, no particular per-spider behaviour would be applied and it will just use the default Scrapy downloader settings.

Supported options:
- `--spider SPIDER`: bypass spider autodetection and force use of specific spider
- `--headers`: print the response’s HTTP headers instead of the response’s body
- `--no-redirect`: do not follow HTTP 3xx redirects (default is to follow them)

Usage examples:

```sh
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}
```

#### `shell`

Syntax: `scrapy shell [url]`

Requires project: no

Starts the Scrapy shell for the given URL (if given) or empty if no URL is given. Also supports UNIX-style local file paths, either relative with `./` or `../` prefixes or absolute file paths. See [Scrapy shell](#2.6-Scrapy-shell) for more info.

Supported options:
- `--spider=SPIDER`: bypass spider autodetection and force use of specific spider
- `-c code`: evaluate the code in the shell, print the result and exit
- `--no-redirect`: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url) will still follow HTTP redirects by default.

Usage example:

```sh
$ scrapy shell http://www.example.com/some/page.html
```
```
[ ... scrapy shell starts ... ]
```
```sh
$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
```
```
(200, 'http://www.example.com/')
```
```sh
# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
```
```
(200, 'http://example.com/')
```
```sh
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
```
```
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
```

#### `view`

Syntax: `scrapy view <url>`

Requires project: no

Opens the given URL in a browser, as your Scrapy spider would “see” it. Sometimes spiders see pages differently from regular users, so this can be used to check what the spider “sees” and confirm it’s what you expect.

Supported options:
- `--spider SPIDER`: bypass spider autodetection and force use of specific spider
- `--no-redirect`: do not follow HTTP 3xx redirects (default is to follow them)

Usage example:
```sh
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
```

#### `settings`

Syntax: `scrapy settings [options]`

Requires project: no

Get the value of a Scrapy setting.

If used inside a project it’ll show the project setting value, otherwise it’ll show the default Scrapy value for that setting.

Example usage:
```sh
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
```

#### `runspider`

Syntax: `scrapy runspider <spider_file.py>`

Requires project: no

Run a spider self-contained in a Python file, without having to create a project.

Example usage:
```sh
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
```

#### `version`

Syntax: `scrapy version [-v]`

Requires project: no

Prints the Scrapy version. If used with `-v` it also prints Python, Twisted and Platform info, which is useful for bug reports.

#### `bench`

Syntax: `scrapy bench`

Requires project: no

Run a quick benchmark test. [Benchmarking](#4.12-Benchmarking).

### Project-only commands

- `crawl`
- `check`
- `list`
- `edit`
- `parse`
- `bench`

#### `crawl`

Syntax: `scrapy crawl <spider>`

Requires project: yes

Start crawling using a spider.

Supported options:
- `-h`, `--help`: show a help message and exit
- `-a NAME=VALUE`: set a spider argument (may be repeated)
- `--output FILE` or `-o FILE`: append scraped items to the end of `FILE` (use - for stdout), to define format set a colon at the end of the output URI (i.e. `-o FILE:FORMAT`)
- `--overwrite-output FILE` or `-O FILE`: dump scraped items into `FILE`, overwriting any existing file, to define format set a colon at the end of the output URI (i.e. `-O FILE:FORMAT`)
- `--output-format FORMAT` or `-t FORMAT`: deprecated way to define format to use for dumping items, does not work in combination with `-O`

Usage examples:
```sh
$ scrapy crawl myspider
```
```
[ ... myspider starts crawling ... ]
```
```sh
$ scrapy crawl -o myfile:csv myspider
```
```
[ ... myspider starts crawling and appends the result to the file myfile in csv format ... ]
```
```sh
$ scrapy crawl -O myfile:json myspider
```
```
[ ... myspider starts crawling and saves the result in myfile in json format overwriting the original content... ]
```
```sh
$ scrapy crawl -o myfile -t csv myspider
```
```
[ ... myspider starts crawling and appends the result to the file myfile in csv format ... ]
```

#### `check`

Syntax: `scrapy check [options] <spider>`

Requires project: yes

Check spider contracts.

Options:

```
  -h, --help            show this help message and exit
  -l, --list            only list contracts, without checking them
  -v, --verbose         print contract tests for all spiders

Global Options
--------------
  --logfile FILE        log file. if omitted stderr will be used
  -L LEVEL, --loglevel LEVEL
                        log level (default: DEBUG)
  --nolog               disable logging completely
  --profile FILE        write python cProfile stats to FILE
  --pidfile FILE        write process ID to FILE
  -s NAME=VALUE, --set NAME=VALUE
                        set/override setting (may be repeated)
  --pdb                 enable pdb on failure
```

Usage examples:

```sh
$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
```

##### Spider contracts

_ChatGPT:_  
In Scrapy, `"spider contracts"` refer to a feature designed to enforce certain rules or constraints on the output of your spiders (the crawlers you create using Scrapy). These contracts are defined using the `scrapy.contracts` module and are meant to ensure that your spiders are behaving as expected and producing the desired output.

Spider contracts can be particularly useful for ensuring data quality and consistency, especially when you're dealing with large-scale web scraping projects where data structure or content might vary across different pages or domains.

Here's a brief overview of how spider contracts work in Scrapy:

- **Defining Contracts:** You define contracts using Python classes that subclass `scrapy.contracts.Contract`. Within these classes, you define methods to verify specific aspects of the spider output, such as the presence of certain fields or the structure of items.

- **Implementing Verification Logic:** Within the contract class methods, you write the logic to verify whether the spider output meets the defined criteria. This logic typically involves inspecting the scraped data and raising exceptions if the criteria are not met.

- **Enforcing Contracts:** Once you've defined your contracts, you can enable them for specific spiders by adding the contracts attribute to your spider classes and specifying which contracts to apply.

- **Running Contract Checks:** When you run your spiders, Scrapy will automatically apply the specified contracts to the spider output and perform the verification checks. If any of the checks fail, Scrapy will raise an exception, indicating that the spider output does not conform to the defined contracts.

By using spider contracts, you can ensure that your spiders are producing reliable and consistent output, which can be essential for downstream processing and analysis of the scraped data.

Here's a simple example of a spider contract class that verifies the presence of certain fields in the spider output:

```python
from scrapy.contracts import Contract


class RequiredFieldsContract(Contract):
    """Contract to verify the presence of required fields in spider output"""

    name = 'required_fields'

    def verify(self, output):
        required_fields = ['title', 'url', 'content']
        for item in output:
            for field in required_fields:
                if field not in item:
                    raise ContractFail(f"Required field '{field}' missing in item: {item}")
```

And here's how you would enable this contract for a specific spider:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    contracts = [RequiredFieldsContract]

    def parse(self, response):
        # Your parsing logic here
        pass
```

With this setup, Scrapy will verify that each item produced by MySpider contains the required fields specified in the `RequiredFieldsContract`. If any item fails this verification, Scrapy will raise a `ContractFail` exception.

#### `list`

Syntax: `scrapy list`

Requires project: yes

List all available spiders in the current project. The output is one spider per line.

Usage example:

```sh
$ scrapy list
spider1
spider2
```

#### `edit`

Syntax: `scrapy edit <spider>`

Requires project: yes

Edit the given spider using the editor defined in the `EDITOR` environment variable or (if unset) the `EDITOR` setting.

This command is provided only as a convenience shortcut for the most common case, the developer is of course free to choose any tool or IDE to write and debug spiders.

Usage example:
```sh
$ scrapy edit spider1
```

#### `parse`

Syntax: `scrapy parse <url> [options]`

Requires project: yes

Fetches the given URL and parses it with the spider that handles it, using the method passed with the `--callback` option, or `parse` if not given.

Supported options:
- `--spider=SPIDER`: bypass spider autodetection and force use of specific spider
- `--a NAME=VALUE`: set spider argument (may be repeated)
- `--callback` or `-c`: spider method to use as callback for parsing the response
- `--meta` or `-m`: additional request meta that will be passed to the callback request. This must be a valid json string. Example: `–meta=’{“foo” : “bar”}`’
- `--cbkwargs`: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: `–cbkwargs=’{“foo” : “bar”}`’
- `--pipelines`: process items through pipelines
- `--rules` or `-r`: use `CrawlSpider` rules to discover the callback (i.e. spider method) to use for parsing the response
- `--noitems`: don’t show scraped items
- `--nolinks`: don’t show extracted links
- `--nocolour`: avoid using pygments to colorize the output
- `--depth` or `-d`: depth level for which the requests should be followed recursively (default: 1)
- `--verbose` or `-v`: display information for each depth level
- `--output` or `-o`: dump scraped items to a file

New in version 2.3.

Usage example:
```sh
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': 'Example item',
 'category': 'Furniture',
 'length': '12 cm'}]

# Requests  -----------------------------------------------------------------
[]
```

## Custom project commands

You can also add your custom project commands by using the `COMMANDS_MODULE` setting. See the Scrapy commands in [scrapy/commands](https://github.com/scrapy/scrapy/tree/master/scrapy/commands) for examples on how to implement your commands.

### `COMMANDS_MODULE`

Default: `''` (empty string)

A module to use for looking up custom Scrapy commands. This is used to add custom commands for your Scrapy project.

Example:
```sh
COMMANDS_MODULE = "mybot.commands"
```

### Register commands via `setup.py` entry points

You can also add Scrapy commands from an external library by adding a `scrapy.commands` section in the entry points of the library `setup.py` file.

The following example adds `my_command` command:

```python
from setuptools import setup, find_packages

setup(
    name="scrapy-mymodule",
    entry_points={
        "scrapy.commands": [
            "my_command=my_scrapy_module.commands:MyCommand",
        ],
    },
)
```

# 2.2 Spiders

**Spiders** are classes which define how a certain site (or a group of sites) will be scraped, including 
- how to perform the `crawl` (i.e. follow links) and 
- how to extract structured data from their pages (i.e. scraping `items`). 

In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

For spiders, the scraping cycle goes through something like this:

1. You start by generating the initial **Requests** to crawl the first URLs, and specify a **callback function** to be called with the `response` downloaded from those requests.

    The first requests to perform are obtained by calling the `start_requests()` method which (by default) generates `Request` for the URLs specified in the `start_urls` and the `parse` method as callback function for the Requests.

1. In the callback function, you parse the **response** (web page) and return 
- [item objects](#2.4-Items), 
- `Request` objects, or 
- an iterable of these objects. 

    Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

3. In callback functions, you parse the page contents, typically using [Selectors](#2.3-Selectors) (but you can also use `BeautifulSoup`, `lxml` or whatever mechanism you prefer) and generate items with the parsed data.

4. Finally, the items returned from the spider will be typically 
- persisted to a database (in some [Item Pipeline](#2.7-Item-Pipeline)) or 
- written to a file using [Feed exports](#2.8-Feed-exports).

Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.

## `scrapy.Spider`

```python
class scrapy.spiders.Spider
class scrapy.Spider
```

This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). It doesn’t provide any special functionality. It just provides a default `start_requests()` implementation which sends requests from the `start_urls` spider attribute and calls the spider’s method `parse` for each of the resulting responses.

### `name`

A string which defines the name for this spider. 

The spider `name` is how the spider is located (and instantiated) by Scrapy, so **it must be unique**. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.

If the spider scrapes a single domain, a common practice is to name the spider after the domain, with or without the TLD. So, for example, a spider that crawls `mywebsite.com` would often be called `mywebsite`.

### `allowed_domains`

An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if `OffsiteMiddleware` is enabled.

Let’s say your target url is `https://www.example.com/1.html`, then add '`example.com`' to the list.

### `start_urls`

A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent `Request` will be generated successively from data contained in the start URLs.

### `custom_settings`

A dictionary of settings that will be overridden from the project wide configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation.

For a list of available built-in settings see: [Built-in settings reference](#Built-in-settings-reference).

### `crawler`

This attribute is set by the `from_crawler()` class method after initializing the class, and links to the `Crawler` object to which this spider instance is bound.

Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions, middlewares, signals managers, etc). See [Crawler API](#Crawler-API) to know more about them.

### `settings`

Configuration for running this spider. This is a `Settings` instance, see the [Settings](#2.11-Settings) topic for a detailed introduction on this subject.

### `logger`

Python logger created with the Spider’s name. You can use it to send log messages through it as described on [Logging from Spiders](#Logging-from-Spiders).

### `state`

A `dict` you can use to persist some spider state between batches. See [Keeping persistent state between batches](#Keeping-persistent-state-between-batches) to know more about it.

### `classmethod from_crawler(crawler, *args, **kwargs)`

This is the class method used by Scrapy to create your spiders.

You probably won’t need to override this directly because the default implementation acts as a proxy to the `__init__()` method, calling it with the given arguments `args` and named arguments `kwargs`.

Nonetheless, this method sets the `crawler` and `settings` attributes in the new instance so they can be accessed later inside the spider’s code.

_Changed in version 2.11: The settings in `crawler.settings` can now be modified in this method, which is handy if you want to modify them based on arguments. As a consequence, these settings aren’t the final values as they can be modified later by e.g. [add-ons](#5.2-Add-ons). For the same reason, most of the `Crawler` attributes aren’t initialized at this point._

The final `settings` and the initialized `Crawler` attributes are available in the `start_requests()` method, handlers of the `engine_started` signal and later.

**Parameters**
- `crawler (Crawler instance)` – crawler to which the spider will be bound
- `args (list)` – arguments passed to the `__init__()` method
- `kwargs (dict)` – keyword arguments passed to the `__init__()` method

### `classmethod update_settings(settings)`

The `update_settings()` method is used to modify the spider’s settings and is called during initialization of a spider instance.

It takes a `Settings` object as a parameter and can add or update the spider’s configuration values. This method is a class method, meaning that it is called on the Spider class and allows all instances of the spider to share the same configuration.

While per-spider settings can be set in `custom_settings`, using `update_settings()` allows you to dynamically 
- add, 
- remove or 
- change settings 

based on other 
- settings, 
- spider attributes or 
- other factors 

and use setting priorities other than 'spider'. Also, it’s easy to extend `update_settings()` in a subclass by overriding it, while doing the same with `custom_settings` can be hard.

For example, suppose a spider needs to modify `FEEDS`:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    custom_feed = {
        "/home/user/documents/items.json": {
            "format": "json",
            "indent": 4,
        }
    }

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings.setdefault("FEEDS", {}).update(cls.custom_feed)
```

### `parse(response)`

This is the default callback used by Scrapy to process downloaded `response`s, when their requests don’t specify a callback.

The `parse` method is in charge of processing the `response` and returning scraped data and/or more URLs to follow. Other `Request`s callbacks have the same requirements as the Spider class.

This method, as well as any other `Request` callback, must return 
- a `Request` object, 
- an [`item` object](#2.4-Items), 
- an iterable of `Request` objects and/or 
- [item objects](#2.4-Items), or 
- `None`.

**Parameters**
- `response` (Response) – the response to parse

### `log(message[, level, component])`

Wrapper that sends a log message through the Spider’s logger, kept for backward compatibility. For more information see [Logging from Spiders](#Logging-from-Spiders).

### `closed(reason)`

Called when the spider closes. This method provides a shortcut to `signals.connect()` for the `spider_closed` signal.

Let’s see an example:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/1.html",
        "http://www.example.com/2.html",
        "http://www.example.com/3.html",
    ]

    def parse(self, response):
        self.logger.info("A response from %s just arrived!", response.url)

    # chat_gpt
    def closed(self, reason):
        self.logger.info("Spider closed: %s", reason)
```

_ChatGPT:_  
In this example, the `closed(reason)` method is defined within the `MySpider` class. When the spider is about to close, Scrapy will automatically call this method, passing the reason for closure as an argument (`reason`).

Inside the `closed(reason)` method, you can include any cleanup logic or additional processing that needs to be performed before the spider closes. In the provided example, it logs a message indicating that the spider has closed and includes the reason for closure (`reason`).

This method provides a convenient way to perform actions when the spider is shutting down, such as closing database connections, saving final results, or logging summary information.

### More examples

Return multiple Requests and items from a single callback:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/1.html",
        "http://www.example.com/2.html",
        "http://www.example.com/3.html",
    ]

    def parse(self, response):
        for h3 in response.xpath("//h3").getall():
            yield {"title": h3}

        for href in response.xpath("//a/@href").getall():
            yield scrapy.Request(response.urljoin(href), self.parse)
```

Instead of `start_urls` you can use `start_requests()` directly; to give data more structure you can use `Item` objects:

```python
import scrapy
from myproject.items import MyItem


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]

    def start_requests(self):
        yield scrapy.Request("http://www.example.com/1.html", self.parse)
        yield scrapy.Request("http://www.example.com/2.html", self.parse)
        yield scrapy.Request("http://www.example.com/3.html", self.parse)

    def parse(self, response):
        for h3 in response.xpath("//h3").getall():
            yield MyItem(title=h3)

        for href in response.xpath("//a/@href").getall():
            yield scrapy.Request(response.urljoin(href), self.parse)
```

## Spider arguments

Spiders can receive arguments that modify their behaviour. Some common uses for spider arguments are to define the start URLs or to restrict the crawl to certain sections of the site, but they can be used to configure any functionality of the spider.

Spider arguments are passed through the `crawl` command using the `-a` option. For example:

```sh
scrapy crawl myspider -a category=electronics
```

Spiders can access arguments in their `__init__` methods:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [f"http://www.example.com/categories/{category}"]
        # ...
```

The default `__init__` method will take any spider arguments and copy them to the spider as attributes. The above example can also be written as follows:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(f"http://www.example.com/categories/{self.category}")
```

If you are running [Scrapy from a script](#Run-Scrapy-from-a-script), you can specify spider arguments when calling `CrawlerProcess.crawl` or `CrawlerRunner.crawl`:

```python
process = CrawlerProcess()
process.crawl(MySpider, category="electronics")
```

Keep in mind that spider arguments are only strings. The spider will not do any parsing on its own. If you were to set the `start_urls` attribute from the command line, you would have to parse it on your own into a list using something like `ast.literal_eval()` or `json.loads()` and then set it as an attribute. Otherwise, you would cause iteration over a `start_urls` string (a very common python pitfall) resulting in each character being seen as a separate url.

A valid use case is to set the `http` auth credentials used by `HttpAuthMiddleware` or the user agent used by `UserAgentMiddleware`:

```sh
scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot
```

Spider arguments can also be passed through the `Scrapyd` `schedule.json` API. See [Scrapyd documentation](https://scrapyd.readthedocs.io/en/latest/).

## Generic Spiders

Scrapy comes with some useful generic spiders that you can use to subclass your spiders from. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from [Sitemaps](https://www.sitemaps.org/index.html), or parsing an XML/CSV feed.

For the examples used in the following spiders, we’ll assume you have a project with a `TestItem` declared in a `myproject.items` module:

```python
import scrapy


class TestItem(scrapy.Item):
    id = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()
```

### CrawlSpider

`class scrapy.spiders.CrawlSpider` [source](https://docs.scrapy.org/en/latest/_modules/scrapy/spiders/crawl.html#CrawlSpider)

This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.

Apart from the attributes inherited from `Spider` (that you must specify), this class supports a new attribute:

- `rules`

Which is a list of one (or more) `Rule` objects. Each `Rule` defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

This spider also exposes an overridable method:

- `parse_start_url(response, **kwargs)` [source](https://docs.scrapy.org/en/latest/_modules/scrapy/spiders/crawl.html#CrawlSpider.parse_start_url)

This method is called for each `response` produced for the URLs in the spider’s `start_urls` attribute. It allows to parse the initial responses and must return either 
- an `item` object, 
- a `Request` object, or 
- an `iterable` containing any of them.

#### Crawling rules

```python
class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None, errback=None)` 
```
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/spiders/crawl.html#Rule)

- `link_extractor` is a [Link Extractor](#2.10-Link-Extractors) object which defines how links will be extracted from each crawled page. Each produced link will be used to generate a `Request` object, which will contain the link’s text in its meta dictionary (under the `link_text` key). If omitted, a default link extractor created with no arguments will be used, resulting in all links being extracted.

- `callback` is a callable or a string (in which case a method from the spider object with that name will be used) to be called for each link extracted with the specified link extractor. This callback receives a `Response` as its first argument and must return either a single instance or an iterable of [item objects](#2.4-Items) and/or `Request` objects (or any subclass of them). As mentioned above, the received `Response` object will contain the text of the link that produced the `Request` in its meta dictionary (under the `link_text` key)

- `cb_kwargs` is a dict containing the keyword arguments to be passed to the callback function.

- `follow` is a boolean which specifies if links should be followed from each `response` extracted with this rule. If callback is `None` `follow` defaults to `True`, otherwise it defaults to `False`.

- `process_links` is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified `link_extractor`. This is mainly used for filtering purposes.

- `process_request` is a callable (or a string, in which case a method from the spider object with that name will be used) which will be called for every `Request` extracted by this rule. This callable should take said request as first argument and the `Response` from which the request originated as second argument. It must return a `Request` object or `None` (to filter out the request).

- `errback` is a callable or a string (in which case a method from the spider object with that name will be used) to be called if any exception is raised while processing a request generated by the rule. It receives a `Twisted Failure` instance as first parameter.

> <b>Warning</b>: Because of its internal implementation, you must explicitly set callbacks for new requests when writing CrawlSpider-based spiders; unexpected behaviour can occur otherwise.

_New in version 2.0: The `errback` parameter._

#### CrawlSpider example

Let’s now take a look at an example `CrawlSpider` with rules:

```python
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com"]

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=(r"category\.php",), deny=(r"subsection\.php",))),
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=(r"item\.php",)), callback="parse_item"),
    )

    def parse_item(self, response):
        self.logger.info("Hi, this is an item page! %s", response.url)
        item = scrapy.Item()
        item["id"] = response.xpath('//td[@id="item_id"]/text()').re(r"ID: (\d+)")
        item["name"] = response.xpath('//td[@id="item_name"]/text()').get()
        item["description"] = response.xpath(
            '//td[@id="item_description"]/text()'
        ).get()
        item["link_text"] = response.meta["link_text"]
        url = response.xpath('//td[@id="additional_data"]/@href').get()
        return response.follow(
            url, self.parse_additional_page, cb_kwargs=dict(item=item)
        )

    def parse_additional_page(self, response, item):
        item["additional_data"] = response.xpath(
            '//p[@id="additional_data"]/text()'
        ).get()
        return item
```

This spider would start crawling `example.com`’s home page, collecting category links, and item links, parsing the latter with the `parse_item` method. For each item response, some data will be extracted from the HTML using XPath, and an `Item` will be filled with it.

# 2.3 Selectors

## Selecting element attributes

# 2.4 Items

The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Spiders may return the extracted data as **items**, Python objects that define **key-value pairs**.

Scrapy supports multiple types of items. When you create an item, you may use whichever type of item you want. When you write code that receives an item, your code should work for any item type.

## Item Types

Scrapy supports the following types of items, via the [itemadapter library](https://github.com/scrapy/itemadapter): 
- dictionaries, 
- `Item` objects, 
- `dataclass` objects, and 
- `attrs` objects.

_ChatGPT:_  
Let's discuss the best practices for choosing among these options:

- **Dictionaries**: They are the most basic and widely used option. They are suitable for simple scraping tasks where the data structure is straightforward and doesn't require much validation or additional functionality. They are also more lightweight compared to defining custom classes.

- **Item objects**: These are recommended for more complex scraping tasks where you need to enforce a specific data structure, perform validation, or implement custom methods for your items. If you prefer a more object-oriented approach and want to leverage features like inheritance and encapsulation, defining Item classes is a good choice.

- **Dataclass objects** and **Attrs objects**: These options provide similar benefits, offering a more concise syntax for defining classes with attributes compared to traditional classes. They are suitable when you want a more structured approach compared to dictionaries but prefer a lightweight alternative to defining full-fledged Item classes. Choosing between dataclasses and attrs largely depends on your familiarity with each approach and any specific features offered by either library that you find beneficial.

In summary, the best practice for choosing the type of items in Scrapy depends on the complexity of your scraping task, your preference for syntax and features, and your familiarity with the available options. For simpler tasks, dictionaries may suffice, while for more complex tasks, defining custom Item classes, dataclasses, or attrs classes may be more appropriate.

### Dictionaries

As an item type, `dict` is convenient and familiar.

### `Item` objects

`Item` provides a `dict`-like API plus additional features that make it the most feature-complete item type:

```python
class scrapy.item.Item([arg])
class scrapy.Item([arg])
```

- `Item` objects replicate the standard `dict` API, including its `__init__` method.
- `Item` allows defining field names, so that:
    - `KeyError` is raised when using undefined field names (i.e. prevents typos going unnoticed)
    - [Item exporters](#5.8-Item-Exporters) can export all fields by default even if the first scraped object does not have values for all of them
- `Item` also allows defining field metadata, which can be used to customize [serialization](#Serialization-of-item-fields).
- `trackref` tracks `Item` objects to help find memory leaks (see [Debugging memory leaks](#Debugging-memory-leaks-with-trackref) with trackref).
- Item objects also provide the following additional API members:
    - `Item.copy()`
    - `Item.deepcopy()`: Return a `deepcopy()` of this item.
    - `fields`: A dictionary containing all declared fields for this `Item`, not only those populated. The keys are the field names and the values are the `Field` objects used in the Item declaration.

Example:

```python
from scrapy.item import Item, Field


class CustomItem(Item):
    one_field = Field()
    another_field = Field()
```

_ChatGPT:_  
In the context of Scrapy, `Field()` objects are used within the `scrapy.Item` subclass definitions. `Field()` is not a datatype in the traditional sense. It's a class provided by Scrapy for defining the structure of items in your spider. Internally, Scrapy uses `Field()` instances to manage and validate the data that is scraped and stored.

When you define an `Item` subclass in Scrapy, you're essentially creating a structured container to hold the data you want to scrape from web pages. Each piece of data you want to extract from a web page corresponds to a field in your Item subclass.

The flexibility of `Field()` objects allows you to handle a wide variety of data types encountered during web scraping. When defining your Scrapy Item subclasses, you should choose the most appropriate data type for each field based on the nature of the data you're scraping.

Here's a basic example of how you might define a Scrapy Item subclass with fields of different data types:

```python
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()             # String
    price = scrapy.Field()            # Float
    in_stock = scrapy.Field()         # Boolean
    tags = scrapy.Field()             # List
    attributes = scrapy.Field()       # Dictionary
    release_date = scrapy.Field()     # Date
```

### `Dataclass` objects

_New in version 2.2._  
[`dataclass()`](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) allows defining item classes with field names, so that [item exporters](#5.8-Item-Exporters) can export all fields by default even if the first scraped object does not have values for all of them.

Additionally, `dataclass` items also allow to:
- define the type and default value of each defined field.
- define custom field metadata through `dataclasses.field()`, which can be used to [customize serialization](#Serialization-of-item-fields).

Example:

```python
from dataclasses import dataclass


@dataclass
class CustomItem:
    one_field: str
    another_field: int
```

> _Note:_ Field types are not enforced at run time.

### `attr.s` objects

_New in version 2.2._  
[`attr.s()`](https://www.attrs.org/en/stable/api-attr.html#attr.s) allows defining item classes with field names, so that [item exporters](#5.8-Item-Exporters) can export all fields by default even if the first scraped object does not have values for all of them.

Additionally, attr.s items also allow to:
- define the type and default value of each defined field.
- define custom field [metadata](https://www.attrs.org/en/stable/examples.html#metadata), which can be used to [customize serialization](#Serialization-of-item-fields).

In order to use this type, the [attrs package](https://www.attrs.org/en/stable/index.html) needs to be installed.

Example:

```python
import attr


@attr.s
class CustomItem:
    one_field = attr.ib()
    another_field = attr.ib()
```

## Working with `Item` objects

### Declaring `Item` subclasses

`Item` subclasses are declared using a simple `class` definition syntax and `Field` objects. Here is an example:

```python
import scrapy


class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)
```

> _Note:_ Those familiar with Django will notice that Scrapy `Items` are declared similar to [Django Models](https://docs.djangoproject.com/en/dev/topics/db/models/), except that Scrapy `Items` are much simpler as there is no concept of different field types.

### Declaring `fields`

`Field` objects are used to specify metadata for each field. For example, the `serializer` function for the `last_updated` field illustrated in the example above.

You can specify any kind of metadata for each field. There is no restriction on the values accepted by `Field` objects. For this same reason, there is no reference list of all available metadata keys. Each key defined in `Field` objects could be used by a different component, and only those components know about it. You can also define and use any other `Field` key in your project too, for your own needs. 

> The main goal of `Field` objects is to provide a way to define all `field` metadata in one place. 

Typically, those components whose behaviour depends on each field use certain field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by each component.

It’s important to note that 

> the `Field` objects used to declare the `item` do not stay assigned as class attributes. Instead, they can be accessed through the `Item.field` attribute.

```python
class scrapy.item.Field([arg])
class scrapy.Field([arg])
```

The `Field` class is just an alias to the built-in `dict` class and doesn’t provide any extra functionality or attributes. In other words, `Field` objects are plain-old Python dicts. A separate class is used to support the `item` declaration syntax based on class attributes.

> _Note:_ `Field` metadata can also be declared for `dataclass` and `attrs` items. Please refer to the documentation for [`dataclasses.field`](https://docs.python.org/3/library/dataclasses.html#dataclasses.field) and [`attr.ib`](https://www.attrs.org/en/stable/api.html#attr.ib) for additional information.

***
_ChatGPT:_  
In Scrapy, the components the authors are referring to are likely the various parts of the framework that utilize the metadata associated with `Field` objects.

Here's a breakdown:

- **`Field` objects**: In Scrapy, `Field` objects are used to define the fields (or attributes) of items that will be scraped from web pages. These fields can represent various types of data such as strings, integers, dates, etc. Each `field` can have associated metadata, which provides additional information or instructions about how the `field` should be processed or handled.

- **Components**: Scrapy is composed of several components that work together to facilitate the scraping process. These components include spiders, pipelines, middlewares, and extensions, among others. Each of these components may utilize the metadata associated with `Field` objects in different ways to customize the scraping behavior or perform additional processing on the scraped data.

- **Metadata keys**: Metadata keys are the specific attributes or properties that can be associated with `Field` objects to provide additional information. Examples of metadata keys include `xpath`, `css`, `serializer`, `output_processor`, etc. These keys can be used to specify how the `field` should be extracted from the web page, processed, or serialized.

- **Customization**: The authors are emphasizing that Scrapy allows for flexibility and customization by allowing users to define and use any metadata keys they need for their specific project requirements. This means that users can define their own metadata keys and use them within their spiders, pipelines, or other components to customize the scraping process according to their needs.

Overall, the authors are highlighting the flexibility and extensibility of Scrapy, allowing users to define and use metadata keys as needed across various components of their scraping projects.
***

Here are some examples of common tasks performed with items, using the `Product` item declared above. You will notice the API is very similar to the dict API.

```python
class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)
```

### Creating `item`s

```python
product = Product(name="Desktop PC", price=1000)
print(product)
```
```
Product(name='Desktop PC', price=1000)
```

### Getting `field` values

```python
product["name"]
Desktop PC

product.get("name")
Desktop PC

product["price"]
1000

product["last_updated"]
Traceback (most recent call last):
    ...
KeyError: 'last_updated'

product.get("last_updated", "not set")
not set

product["lala"]  # getting unknown field
Traceback (most recent call last):
    ...
KeyError: 'lala'

product.get("lala", "unknown field")
'unknown field'

"name" in product  # is name field populated?
True

"last_updated" in product  # is last_updated populated?
False

"last_updated" in product.fields  # is last_updated a declared field?
True

"lala" in product.fields  # is lala a declared field?
False
```

### Setting `field` values

```python
product["last_updated"] = "today"
product["last_updated"]
today

product["lala"] = "test"  # setting unknown field
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'
```

### Accessing all populated values

To access all populated values, just use the typical dict API:

```python
product.keys()
['price', 'name']

product.items()
[('price', 1000), ('name', 'Desktop PC')]
```

### Copying `item`s

To copy an item, you must first decide whether you want a **shallow copy** (by reference) or a **deep copy** (by value).

If your item contains mutable values like lists or dictionaries, a _shallow copy will keep references_ to the same mutable values across all different copies.

For example, if you have an item with a list of tags, and you create a shallow copy of that item, both the original item and the copy have the same list of tags. Adding a tag to the list of one of the items will add the tag to the other item as well.

If that is not the desired behavior, use a deep copy instead.

See [`copy`](https://docs.python.org/3/library/copy.html#module-copy) for more information.

To create a shallow copy of an item, you can either call `copy()` on an existing item (`product2 = product.copy()`) or instantiate your `item` class from an existing item (`product2 = Product(product)`).

To create a deep copy, call `deepcopy()` instead (`product2 = product.deepcopy()`).

### Other common tasks

Creating dicts from items:

```python
dict(product)  # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
```

Creating items from dicts:
```python
Product({"name": "Laptop PC", "price": 1500})
Product(price=1500, name='Laptop PC')

Product({"name": "Laptop PC", "lala": 1500})  # warning: unknown field in dict
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'
```

### Extending `Item` subclasses

You can extend `Item`s (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.

For example:
```python
class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()
```

You can also extend field metadata by using the previous field metadata and appending more values, or changing existing values, like this:
```python
class SpecificProduct(Product):
    name = scrapy.Field(Product.fields["name"], serializer=my_serializer)
```

That adds (or replaces) the serializer metadata key for the `name` field, keeping all the previously existing metadata values.

## Supporting All `Item` Types

In code that receives an `item`, such as methods of [item pipelines](#2.7-Item-Pipeline) or [spider middlewares](#5.4-Spider-Middleware), it is a good practice to use the `ItemAdapter` class and the `is_item()` function to write code that works for any supported item type:

- `class itemadapter.ItemAdapter(item: Any)` [source](https://docs.scrapy.org/en/latest/_modules/itemadapter/adapter.html#ItemAdapter): 
    - Wrapper class to interact with data container objects. It provides a common interface to extract and set data without having to take the object’s type into account.
- `itemadapter.is_item(obj: Any)→ bool` [source](https://docs.scrapy.org/en/latest/_modules/itemadapter/utils.html#is_item): 
    - Return `True` if the given object belongs to one of the supported types, `False` otherwise. Alias for `ItemAdapter.is_item`

- `class scrapy.item.ItemMeta(class_name, bases, attrs)` [source](https://docs.scrapy.org/en/latest/_modules/scrapy/item.html#ItemMeta): 
    - [Metaclass](https://realpython.com/python-metaclasses) of `Item` that handles field definitions.

# 2.5 Item Loaders

**Item Loaders** provide a convenient mechanism for populating scraped items. Even though items can be populated directly, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

In other words, 

> items provide the _container_ of scraped data, while Item Loaders provide the mechanism for _populating_ that container.

Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different `field` parsing rules, either by spider, or by source format (`HTML`, `XML`, etc) without becoming a nightmare to maintain.

> _Note:_ Item Loaders are an extension of the [itemloaders library](https://itemloaders.readthedocs.io/en/latest/) that make it easier to work with Scrapy by adding support for [responses](#2.9-Requests-and-Responses).

## Using Item Loaders to populate items

To use an Item Loader, you must first instantiate it. You can either instantiate it with an `item` object or without one, in which case an `item` object is automatically created in the Item Loader `__init__` method using the `item` class specified in the `ItemLoader.default_item_class` attribute.

Then, you start collecting values into the Item Loader, typically using [Selectors](#2.3-Selectors). You can add more than one value to the same item `field`; the Item Loader will know how to “join” those values later using a proper processing function.

> _Note:_ Collected data is internally stored as `list`s, allowing to add several values to the same `field`. If an item argument is passed when creating a loader, each of the item’s values will be stored as-is if it’s already an iterable, or wrapped with a list if it’s a single value.

Here is a typical Item Loader usage in a Spider, using the `Product` item declared in the Items chapter:

```python
import scrapy


class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)
```

```python
from scrapy.loader import ItemLoader
from myproject.items import Product


def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath("name", '//div[@class="product_name"]')
    l.add_xpath("name", '//div[@class="product_title"]')
    l.add_xpath("price", '//p[@id="price"]')
    l.add_css("stock", "p#stock")
    l.add_value("last_updated", "today")  # you can also use literal values
    return l.load_item()
```

By quickly looking at that code, we can see the `name` field is being extracted from two different XPath locations in the page:

```html
//div[@class="product_name"]

//div[@class="product_title"]
```

In other words, data is being collected by extracting it from two XPath locations, using the `add_xpath()` method. This is the data that will be assigned to the `name` field later.

Afterwards, similar calls are used for `price` and `stock` fields (the latter using a CSS selector with the `add_css()` method), and finally the `last_update` field is populated directly with a literal value (`today`) using a different method: `add_value()`.

Finally, when all data is collected, the `ItemLoader.load_item()` method is called which actually returns the `item` populated with the data previously extracted and collected with the `add_xpath()`, `add_css()`, and `add_value()` calls.

## Working with `dataclass` items

By default, `dataclass` items require all fields to be passed when created. This could be an issue when using `dataclass` items with item loaders: 
- unless a pre-populated item is passed to the loader, fields will be populated incrementally using the loader’s `add_xpath()`, `add_css()` and `add_value()` methods.

One approach to overcome this is to define items using the `field()` function, with a default argument:

```python
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class InventoryItem:
    name: Optional[str] = field(default=None)
    price: Optional[float] = field(default=None)
    stock: Optional[int] = field(default=None)
```

## Input and Output processors

An Item Loader contains one **input processor** and one **output processor** for each (`item`) `field`. 

The input processor processes the extracted data as soon as it’s received (through the `add_xpath()`, `add_css()` or `add_value()` methods) and the result of the input processor is collected and kept inside the `ItemLoader`. 

After collecting all data, the `ItemLoader.load_item()` method is called to populate and get the populated `item` object. That’s when the output processor is called with the data previously collected (and processed using the input processor). The result of the output processor is the final value that gets assigned to the `item`.

Let’s see an example to illustrate how the input and output processors are called for a particular `field` (the same applies for any other field):

```python
l = ItemLoader(Product(), response)
l.add_xpath("name", xpath1)  # (1)
l.add_xpath("name", xpath2)  # (2)
l.add_css("name", css)  # (3)
l.add_value("name", "test")  # (4)
return l.load_item()  # (5)
```

So what happens is:
1. Data from `xpath1` is extracted, and passed through the input processor of the `name` field. The result of the input processor is collected and kept in the Item Loader (but not yet assigned to the item).
1. Data from `xpath2` is extracted, and passed through the same input processor used in (`1`). The result of the input processor is appended to the data collected in (`1`) (if any).
1. This case is similar to the previous ones, except that the data is extracted from the css `CSS` selector, and passed through the same input processor used in (`1`) and (`2`). The result of the input processor is appended to the data collected in (`1`) and (`2`) (if any).
1. This case is also similar to the previous ones, except that the value to be collected is assigned directly, instead of being extracted from a `XPath` expression or a `CSS` selector. However, the value is still passed through the input processors. In this case, since the value is not iterable it is converted to an iterable of a single element before passing it to the input processor, because input processor always receive iterables.
1. The data collected in steps (`1`), (`2`), (`3`) and (`4`) is passed through the output processor of the `name` field. The result of the output processor is the value assigned to the `name` field in the `item`.

It’s worth noticing that processors are just callable objects, which are called with the data to be parsed, and return a parsed value. So you can use any function as input or output processor. The only requirement is that they must accept one (and only one) positional argument, which will be an iterable.

_Changed in version 2.0: Processors no longer need to be methods._

> _Note:_ Both input and output processors must receive an iterable as their first argument. The output of those functions can be anything. The result of input processors will be appended to an internal list (in the `Loader`) containing the collected values (for that `field`). The result of the output processors is the value that will be finally assigned to the `item`.

The other thing you need to keep in mind is that the values returned by input processors are collected internally (in lists) and then passed to output processors to populate the fields.

Last, but not least, itemloaders come with some [commonly used processors](https://itemloaders.readthedocs.io/en/latest/built-in-processors.html#built-in-processors) built-in for convenience.

### Available [built-in processors](https://itemloaders.readthedocs.io/en/latest/built-in-processors.html#built-in-processors)

Even though you can use any callable function as input and output processors, `itemloaders` provides some commonly used processors, which are described below.

Some of them, like the `MapCompose` (which is typically used as input processor) compose the output of several functions executed in order, to produce the final parsed value.

This module provides some commonly used processors for Item Loaders.

See documentation in `docs/topics/loaders.rst`

#### `Compose`

`class itemloaders.processors.Compose(*functions, **default_loader_context)` [source](https://itemloaders.readthedocs.io/en/latest/_modules/itemloaders/processors.html#Compose)

A processor which is constructed from the composition of the given functions. This means that each input value of this processor is passed to the first function, and the result of that function is passed to the second function, and so on, until the last function returns the output value of this processor.

By default, stop process on `None` value. This behaviour can be changed by passing keyword argument `stop_on_none=False`.

Example:

```ipython
>>> from itemloaders.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['hello', 'world'])
'HELLO'
```

Each function can optionally receive a [`loader_context`](https://itemloaders.readthedocs.io/en/latest/loaders-context.html#loaders-context) parameter. For those which do, this processor will pass the currently active `Loader` context through that parameter.

The keyword arguments passed in the `__init__` method are used as the default `Loader` context values passed to each function call. However, the final `Loader` context values passed to functions are overridden with the currently active `Loader` context accessible through the `ItemLoader.context` attribute.

#### `Identity`

`class itemloaders.processors.Identity` [source](https://itemloaders.readthedocs.io/en/latest/_modules/itemloaders/processors.html#Identity)

The simplest processor, which doesn’t do anything. It returns the original values unchanged. It doesn’t receive any `__init__` method arguments, nor does it accept Loader contexts.

Example:

```ipython
>>> from itemloaders.processors import Identity
>>> proc = Identity()
>>> proc(['one', 'two', 'three'])
['one', 'two', 'three']
```

#### `Join`

`class itemloaders.processors.Join(separator=' ')` [source](https://itemloaders.readthedocs.io/en/latest/_modules/itemloaders/processors.html#Join)

Returns the values joined with the separator given in the `__init__` method, which defaults to `' '`. It doesn’t accept **Loader contexts**.

When using the default separator, this processor is equivalent to the function: `' '.join`.

Examples:

```ipython
>>> from itemloaders.processors import Join
>>> proc = Join()
>>> proc(['one', 'two', 'three'])
'one two three'
>>> proc = Join('<br>')
>>> proc(['one', 'two', 'three'])
'one<br>two<br>three'
```

#### `MapCompose`

`class itemloaders.processors.MapCompose(*functions, **default_loader_context)` [source](https://itemloaders.readthedocs.io/en/latest/_modules/itemloaders/processors.html#MapCompose)

A processor which is constructed from the composition of the given functions, similar to the `Compose` processor. The difference with this processor is the way internal results are passed among functions, which is as follows:

The input value of this processor is iterated and the first function is applied to each element. The results of these function calls (one for each element) are concatenated to construct a new iterable, which is then used to apply the second function, and so on, until the last function is applied to each value of the list of values collected so far. The output values of the last function are concatenated together to produce the output of this processor.

Each particular function can return a value or a list of values, which is flattened with the list of values returned by the same function applied to the other input values. The functions can also return `None` in which case the output of that function is ignored for further processing over the chain.

This processor provides a convenient way to compose functions that only work with single values (instead of iterables). For this reason the `MapCompose` processor is typically used as input processor, since data is often extracted using the `extract()` method of [parsel selectors](https://parsel.readthedocs.io/en/latest/parsel.html#parsel.selector.Selector.extract), which returns a list of unicode strings.

The example below should clarify how it works:

```ipython
>>> def filter_world(x):
...     return None if x == 'world' else x
...
>>> from itemloaders.processors import MapCompose
>>> proc = MapCompose(filter_world, str.upper)
>>> proc(['hello', 'world', 'this', 'is', 'something'])
['HELLO', 'THIS', 'IS', 'SOMETHING']
```

As with the `Compose` processor, functions can receive `Loader` contexts, and `__init__` method keyword arguments are used as default context values. See [`Compose`](#Compose) processor for more info.

#### `SelectJmes`

`class itemloaders.processors.SelectJmes(json_path)` [source](https://itemloaders.readthedocs.io/en/latest/_modules/itemloaders/processors.html#SelectJmes)

Query the input string for the `jmespath` (given at instantiation), and return the answer. Requires : [jmespath](https://github.com/jmespath/jmespath) 

> _Note:_ `SelectJmes` accepts only one input element at a time.

Example:

```ipython
>>> from itemloaders.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("foo") #for direct use on lists and dictionaries
>>> proc({'foo': 'bar'})
'bar'
>>> proc({'foo': {'bar': 'baz'}})
{'bar': 'baz'}
```

Working with Json:

```ipython
>>> import json
>>> proc_single_json_str = Compose(json.loads, SelectJmes("foo"))
>>> proc_single_json_str('{"foo": "bar"}')
'bar'
>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('foo')))
>>> proc_json_list('[{"foo":"bar"}, {"baz":"tar"}]')
['bar']
```

#### `TakeFirst`

`class itemloaders.processors.TakeFirst` [source](https://itemloaders.readthedocs.io/en/latest/_modules/itemloaders/processors.html#TakeFirst)

Returns the first non-null/non-empty value from the values received, so it’s typically used as an output processor to single-valued fields. It doesn’t receive any `__init__` method arguments, nor does it accept Loader contexts.

Example:

```ipython
>>> from itemloaders.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'one', 'two', 'three'])
'one'
```

## Declaring Item Loaders

Item Loaders are declared using a class definition syntax. Here is an example:

```python
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader


class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()

    name_in = MapCompose(str.title)
    name_out = Join()

    price_in = MapCompose(str.strip)

    # ...
```

As you can see, input processors are declared using the `_in` suffix while output processors are declared using the `_out` suffix. And you can also declare a default input/output processors using the `ItemLoader.default_input_processor` and `ItemLoader.default_output_processor` attributes.

_ChatGPT:_  
`*_in` and `*_out` are special attributes recognized by Scrapy's ItemLoader class for defining input and output processors, respectively.

- `*_in`: Specifies how input data should be processed before being assigned to the corresponding item field.
- `*_out`: Specifies how the output data should be processed before being outputted from the item loader.

These attributes allow you to define custom processing logic for input and output data, providing flexibility in how data is handled during the loading process.

When you define input processors (`*_in`), they are applied to the input data (e.g., extracted values from XPath expressions) before assigning them to item fields. This allows you to clean or transform the input data as needed.

Similarly, when you define output processors (`*_out`), they are applied to the output data (e.g., final values of item fields) before outputting them from the item loader. This allows you to further process or format the output data before it's returned from the loader.

By using input and output processors, you can customize how data is processed during the loading process, ensuring that it meets your specific requirements and standards. This enhances the flexibility and power of Scrapy's item loading mechanism.

## Declaring Input and Output Processors

As seen in the previous section, input and output processors can be declared in the Item Loader definition, and it’s very common to declare input processors this way. However, there is one more place where you can specify the input and output processors to use: in the `Item` `Field` metadata. Here is an example:

```python
import scrapy
from itemloaders.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags


def filter_price(value):
    if value.isdigit():
        return value


class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Join(),
    )
    price = scrapy.Field(
        input_processor=MapCompose(remove_tags, filter_price),
        output_processor=TakeFirst(),
    )
```

```ipython
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value("name", ["Welcome to my", "<strong>website</strong>"])
>>> il.add_value("price", ["&euro;", "<span>1000</span>"])
>>> il.load_item()
{'name': 'Welcome to my website', 'price': '1000'}
```

The precedence order, for both input and output processors, is as follows:
1. Item Loader field-specific attributes: `field_in` and `field_out` (most precedence)
1. Field metadata (`input_processor` and `output_processor` key)
1. Item Loader defaults: `ItemLoader.default_input_processor()` and `ItemLoader.default_output_processor()` (least precedence)

See also: [Reusing and extending Item Loaders](#Reusing-and-extending-Item-Loaders).

## Item Loader Context

The **Item Loader Context** is a `dict` of arbitrary key/values which is shared among all input and output processors in the **Item Loader**. It can be passed when declaring, instantiating or using Item Loader. They are used to modify the behaviour of the input/output processors.

For example, suppose you have a function `parse_length` which receives a text value and extracts a length from it:

```python
def parse_length(text, loader_context):
    unit = loader_context.get("unit", "m")
    # ... length parsing code goes here ...
    return parsed_length
```

By accepting a `loader_context` argument the function is explicitly telling the Item Loader that it’s able to receive an Item Loader context, so the Item Loader passes the currently active context when calling it, and the processor function (`parse_length` in this case) can thus use them.

There are several ways to modify Item Loader context values:

- By modifying the currently active Item Loader context (`context` attribute):

```python
loader = ItemLoader(product)
loader.context["unit"] = "cm"
```

- On Item Loader instantiation (the keyword arguments of Item Loader `__init__` method are stored in the Item Loader context):

```python
loader = ItemLoader(product, unit="cm")
```

- On Item Loader declaration, for those input/output processors that support instantiating them with an Item Loader context. `MapCompose` is one of them:

```python
class ProductLoader(ItemLoader):
    length_out = MapCompose(parse_length, unit="cm")
```

## `ItemLoader` objects

```python
class scrapy.loader.ItemLoader(item=None, selector=None, response=None, parent=None, **context)
``` 
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/loader.html#ItemLoader)

A user-friendly abstraction to populate an item with data by applying field processors to scraped data. When instantiated with a `selector` or a `response` it supports data extraction from web pages using selectors.

**Parameters**
- `item (scrapy.item.Item)` – The item instance to populate using subsequent calls to 
    - `add_xpath()`, 
    - `add_css()`, or 
    - `add_value()`.
- `selector (Selector object)` – The selector to extract data from, when using the 
    - `add_xpath()`, 
    - `add_css()`, 
    - `replace_xpath()`, or 
    - `replace_css()` method.
- `response (Response object)` – The response used to construct the selector using the `default_selector_class`, unless the selector argument is given, in which case this argument is ignored.

If no `item` is given, one is instantiated automatically using the class in `default_item_class`.

The `item`, `selector`, `response` and remaining keyword arguments are assigned to the Loader context (accessible through the `context` attribute).

### `item`

The item object being parsed by this Item Loader. This is mostly used as a property so, when attempting to override this value, you may want to check out `default_item_class` first.

### `context`

The currently active `Context` of this Item Loader.

### `default_item_class`

An `item` class (or factory), used to instantiate items when not given in the `__init__` method.

### `default_input_processor`

The default input processor to use for those fields which don’t specify one.

### `default_output_processor`

The default output processor to use for those fields which don’t specify one.

### `default_selector_class`

The class used to construct the selector of this `ItemLoader`, if only a `response` is given in the `__init__` method. If a `selector` is given in the `__init__` method this attribute is ignored. This attribute is sometimes overridden in subclasses.

### `selector`

The `Selector` object to extract data from. It’s either the `selector` given in the `__init__` method or one created from the `response` given in the `__init__` method using the `default_selector_class`. This attribute is meant to be read-only.

### `add_css()`

`add_css(field_name, css, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.add_css)

Similar to `ItemLoader.add_value()` but receives a CSS `selector` instead of a value, which is used to extract a list of unicode strings from the `selector` associated with this `ItemLoader`.

See `get_css()` for `kwargs`.

**Parameters**:
- `css (str)` – the CSS selector to extract data from

Examples:

```python
# HTML snippet: <p class="product-name">Color TV</p>
loader.add_css('name', 'p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_css('price', 'p#price', re='the price is (.*)')
```

### `add_jmes()`

`add_jmes(field_name, jmes, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.add_jmes)

Similar to `ItemLoader.add_value()` but receives a JMESPath `selector` instead of a value, which is used to extract a list of unicode strings from the selector associated with this `ItemLoader`.

See `get_jmes()` for `kwargs`.

**Parameters**
- `jmes (str)` – the JMESPath selector to extract data from

Examples:

```python
# HTML snippet: {"name": "Color TV"}
loader.add_jmes('name')
# HTML snippet: {"price": the price is $1200"}
loader.add_jmes('price', TakeFirst(), re='the price is (.*)')
```

### `add_value()`

`add_value(field_name, value, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.add_value)

Process and then add the given value for the given field.

The value is first passed through `get_value()` by giving the processors and kwargs, and then passed through the field input processor and its result appended to the data collected for that field. If the field already contains collected data, the new data is added.

The given `field_name` can be `None`, in which case values for multiple fields may be added. And the processed value should be a `dict` with `field_name` mapped to values.

Examples:

```python
loader.add_value('name', 'Color TV')
loader.add_value('colours', ['white', 'blue'])
loader.add_value('length', '100')
loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {'name': 'foo', 'sex': 'male'})
```

### `add_xpath()`

`add_xpath(field_name, xpath, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.add_xpath)

Similar to `ItemLoader.add_value()` but receives an XPath instead of a value, which is used to extract a list of strings from the selector associated with this `ItemLoader`.

See `get_xpath()` for `kwargs`.

**Parameters**
- `xpath (str)` – the XPath to extract data from

Examples:

```python
# HTML snippet: <p class="product-name">Color TV</p>
loader.add_xpath('name', '//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
```

### `get_collected_values()`

`get_collected_values(field_name)` 

[source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.get_collected_values)

Return the collected values for the given field.

### `get_css()`

`get_css(css, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.get_collected_values)

Similar to `ItemLoader.get_value()` but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

**Parameters**
- `css (str)` – the CSS selector to extract data from
- `re (str or Pattern)` – a regular expression to use for extracting data from the selected CSS region

Examples:

```python
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
```

### `get_jmes()`

`get_jmes(jmes, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.get_jmes)

Similar to `ItemLoader.get_value()` but receives a JMESPath selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

**Parameters**
- `jmes (str)` – the JMESPath selector to extract data from
- `re (str or Pattern)` – a regular expression to use for extracting data from the selected JMESPath

Examples:

```python
# HTML snippet: {"name": "Color TV"}
loader.get_jmes('name')
# HTML snippet: {"price": the price is $1200"}
loader.get_jmes('price', TakeFirst(), re='the price is (.*)')
```

### `get_output_value()`

`get_output_value(field_name)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.get_output_value)

Return the collected values parsed using the output processor, for the given `field`. This method doesn’t populate or modify the `item` at all.

### `get_value()`

`get_value(value, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.get_value)

Process the given value by the given processors and keyword arguments.

Available keyword arguments:

**Parameters**
- `re (str or Pattern)` – a regular expression to use for extracting data from the given value using `extract_regex()` method, applied before processors

Examples:

```ipython
>>> from itemloaders import ItemLoader
>>> from itemloaders.processors import TakeFirst
>>> loader = ItemLoader()
>>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)')
'FOO'
```

### `get_xpath()`

`get_xpath(xpath, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.get_xpath)

Similar to `ItemLoader.get_value()` but receives an XPath instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

**Parameters**
- `xpath (str)` – the XPath to extract data from
- `re (str or Pattern)` – a regular expression to use for extracting data from the selected XPath region

Examples:

```python
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
```

### `load_item()`

[source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.load_item)

Populate the item with the data collected so far, and return it. The data collected is first passed through the output processors to get the final value to assign to each item field.

### `nested_css()`

`nested_css(css, **context)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.nested_css)

Create a nested loader with a css selector. The supplied selector is applied relative to selector associated with this `ItemLoader`. The nested loader shares the `item` with the parent `ItemLoader` so calls to `add_xpath()`, `add_value()`, `replace_value()`, etc. will behave as expected.

### `nested_xpath()`

`nested_xpath(xpath, **context)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.nested_xpath)

Create a nested loader with an xpath selector. The supplied selector is applied relative to selector associated with this `ItemLoader`. The nested loader shares the item with the parent `ItemLoader` so calls to `add_xpath()`, `add_value()`, `replace_value()`, etc. will behave as expected.

### `replace_css()`

`replace_css(field_name, css, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.replace_css)

Similar to `add_css()` but replaces collected data instead of adding it.

### `replace_jmes()`

`replace_jmes(field_name, jmes, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.replace_jmes)

Similar to `add_jmes()` but replaces collected data instead of adding it.

### `replace_value()`

`replace_value(field_name, value, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.replace_value)

Similar to `add_value()` but replaces the collected data with the new value instead of adding it.

### `replace_xpath()`

`replace_xpath(field_name, xpath, *processors, re=None, **kw)` [source](https://docs.scrapy.org/en/latest/_modules/itemloaders.html#ItemLoader.replace_xpath)

Similar to `add_xpath()` but replaces collected data instead of adding it.

## Nested Loaders

When parsing related values from a subsection of a document, it can be useful to create nested loaders. Imagine you’re extracting details from a footer of a page that looks something like:

Example:

```html
<footer>
    <a class="social" href="https://facebook.com/whatever">Like Us</a>
    <a class="social" href="https://twitter.com/whatever">Follow Us</a>
    <a class="email" href="mailto:whatever@example.com">Email Us</a>
</footer>
```

Without nested loaders, you need to specify the full `xpath` (or `css`) for each value that you wish to extract.

Example:

```python
loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath("social", '//footer/a[@class = "social"]/@href')
loader.add_xpath("email", '//footer/a[@class = "email"]/@href')
loader.load_item()
```

Instead, you can create a nested loader with the footer selector and add values relative to the footer. The functionality is the same but you avoid repeating the footer selector.

Example:

```python
loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath("//footer")
footer_loader.add_xpath("social", 'a[@class = "social"]/@href')
footer_loader.add_xpath("email", 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()
```

You can nest loaders arbitrarily and they work with either `xpath` or `css` selectors. As a general guideline, use nested loaders when they make your code simpler but do not go overboard with nesting or your parser can become difficult to read.

## Reusing and extending Item Loaders

As your project grows bigger and acquires more and more spiders, maintenance becomes a fundamental problem, especially when you have to deal with many different parsing rules for each spider, having a lot of exceptions, but also wanting to reuse the common processors.

Item Loaders are designed to ease the maintenance burden of parsing rules, without losing flexibility and, at the same time, providing a convenient mechanism for extending and overriding them. For this reason Item Loaders support traditional Python class inheritance for dealing with differences of specific spiders (or groups of spiders).

Suppose, for example, that some particular site encloses their product names in three dashes (e.g. `---Plasma TV---`) and you don’t want to end up scraping those dashes in the final product names.

Here’s how you can remove those dashes by reusing and extending the default Product Item Loader (`ProductLoader`):

```python
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader


def strip_dashes(x):
    return x.strip("-")


class SiteSpecificLoader(ProductLoader):
    name_in = MapCompose(strip_dashes, ProductLoader.name_in)
```

Another case where extending Item Loaders can be very helpful is when you have multiple source formats, for example `XML` and `HTML`. In the `XML` version you may want to remove CDATA occurrences. Here’s an example of how to do it:

```python
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata


class XmlProductLoader(ProductLoader):
    name_in = MapCompose(remove_cdata, ProductLoader.name_in)
```

And that’s how you typically extend input processors.

As for output processors, 

> it is more common to declare them in the field metadata, as they usually depend only on the field and not on each specific site parsing rule (as input processors do). 

See also: [Declaring Input and Output Processors](#Declaring-Input-and-Output-Processors).

There are many other possible ways to extend, inherit and override your Item Loaders, and different Item Loaders hierarchies may fit better for different projects. Scrapy only provides the mechanism; it doesn’t impose any specific organization of your Loaders collection - that’s up to you and your project’s needs.

# 2.6 Scrapy shell

The `Scrapy shell` is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell.

The shell is used for testing `XPath` or `CSS` expressions and see how they work and what data they extract from the web pages you’re trying to scrape. It allows you to interactively test your expressions while you’re writing your spider, without having to run the spider to test every change.

Once you get familiarized with the Scrapy shell, you’ll see that it’s an invaluable tool for developing and debugging your spiders.

## Configuring the shell

If you have [IPython](https://ipython.org/) installed, the Scrapy shell will use it (instead of the standard Python console). The IPython console is much more powerful and provides smart auto-completion and colorized output, among other things.

We highly recommend you install IPython, specially if you’re working on Unix systems (where IPython excels). 

Scrapy also has support for [bpython](https://bpython-interpreter.org/), and will try to use it where IPython is unavailable.

Through Scrapy’s settings you can configure it to use any one of ipython, bpython or the standard python shell, regardless of which are installed. This is done by setting the `SCRAPY_PYTHON_SHELL` environment variable; or by defining it in your `scrapy.cfg`:

```sh
[settings]
shell = bpython
```

## Launch the shell

To launch the `Scrapy shell` you can use the shell command like this:

```sh
scrapy shell <url>
```

Where the `<url>` is the URL you want to scrape.

`shell` also works for local files. This can be handy if you want to play around with a local copy of a web page. shell understands the following syntaxes for local files:

```sh
# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html
```

> _Note:_ When using relative file paths, be explicit and prepend them with `./` (or `../` when relevant). `scrapy shell index.html` will not work as one might expect (and this is by design, not a bug).<br>
</br>
Because shell favors HTTP URLs over File URIs, and `index.html` being syntactically similar to `example.com`, shell will treat `index.html` as a domain name and trigger a DNS lookup error:

```sh
$ scrapy shell index.html
[ ... scrapy shell starts ... ]
[ ... traceback ... ]
twisted.internet.error.DNSLookupError: DNS lookup failed:
address 'index.html' not found: [Errno -5] No address associated with hostname.
```

`shell` will not test beforehand if a file called `index.html` exists in the current directory. Again, _**be explicit**_.

## Using the `shell`

The Scrapy `shell` is just a regular Python console (or IPython console if you have it available) which provides some additional shortcut functions for convenience.

### Available Shortcuts

- `shelp()` - print a help with the list of available objects and shortcuts

- `fetch(url[, redirect=True])` - fetch a new response from the given URL and update all related objects accordingly. You can optionally ask for HTTP `3xx` redirections to not be followed by passing `redirect=False`

- `fetch(request)` - fetch a new response from the given `request` and update all related objects accordingly.

- `view(response)` - open the given response in your local web browser, for inspection. This will add a [`<base>` tag](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base) to the `response` body in order for external links (such as images and style sheets) to display properly. Note, however, that this will create a temporary file in your computer, which _**won’t be removed automatically**_.

### Available Scrapy objects

The Scrapy `shell` automatically creates some convenient objects from the downloaded page, like the `Response` object and the `Selector` objects (for both `HTML` and `XML` content).

Those objects are:

- `crawler` - the current `Crawler` object.

- `spider` - the `Spider` which is known to handle the URL, or a Spider object if there is no spider found for the current URL

- `request` - a `Request` object of the last fetched page. You can modify this request using `replace()` or fetch a new request (without leaving the shell) using the `fetch` shortcut.

- `response` - a `Response` object containing the last fetched page

- `settings` - the current [Scrapy settings](#2.11-Settings)

### Example of shell session

Here’s an example of a typical shell session where we start by scraping the [https://scrapy.org](https://scrapy.org) page, and then proceed to scrape the [https://old.reddit.com/](https://old.reddit.com/) page. Finally, we modify the (Reddit) `request` method to `POST` and re-fetch it getting an error. We end the session by typing `Ctrl-D` (in Unix systems) or `Ctrl-Z` in Windows.

Keep in mind that the data extracted here may not be the same when you try it, as those pages are not static and could have changed by the time you test this. The only purpose of this example is to get you familiarized with how the Scrapy `shell` works.

First, we launch the shell:

```sh
scrapy shell 'https://scrapy.org' --nolog
```

> _Note:_ Remember to always enclose URLs in quotes when running the Scrapy shell from the command line, otherwise URLs containing arguments (i.e. the `&` character) will not work.

On Windows, use double quotes instead:

```sh
scrapy shell "https://scrapy.org" --nolog
```

Then, the shell fetches the URL (using the Scrapy downloader) and prints the list of available objects and useful shortcuts (you’ll notice that these lines all start with the [s] prefix):

```
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f6b866f6790>
[s]   item       {}
[s]   request    <GET https://scrapy.org>
[s]   response   <200 https://scrapy.org>
[s]   settings   <scrapy.settings.Settings object at 0x7f6b868cd5d0>
[s]   spider     <DefaultSpider 'default' at 0x7f6b868992d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:
```

After that, we can start playing with the objects:

```ipython
In [1]: response.xpath("//title/text()").get()
Out[1]: 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

In [2]: fetch('https://old.reddit.com/')

In [3]: response.xpath("//title/text()").get()
Out[3]: 'reddit: the front page of the internet'

In [4]: request = request.replace(method='POST')

In [5]: fetch(request)

In [6]: response.status
Out[6]: 404

In [7]: from pprint import pprint

In [8]: pprint(response.headers)
{b'Accept-Ranges': [b'bytes'],
 b'Cache-Control': [b'private, max-age=3600'],
 b'Content-Length': [b'8919'],
 b'Content-Type': [b'text/html; charset=UTF-8'],
 b'Date': [b'Sun, 11 Feb 2024 09:04:10 GMT'],
 b'Nel': [b'{"report_to": "w3-reporting-nel", "max_age": 14400, "include_sub'
          b'domains": false, "success_fraction": 1.0, "failure_fraction": 1.'
          b'0}'],
 b'Report-To': [b'{"group": "w3-reporting-nel", "max_age": 14400, "include_sub'
                b'domains": true,  "endpoints": [{ "url": "https://w3-reportin'
                b'g-nel.reddit.com/reports" }]}, {"group": "w3-reporting", "ma'
                b'x_age": 14400, "include_subdomains": true, "endpoints": [{ "'
                b'url": "https://w3-reporting.reddit.com/reports" }]}, {"group'
                b'": "w3-reporting-csp", "max_age": 14400, "include_subdomains'
                b'": true, "endpoints": [{ "url": "https://w3-reporting-csp.re'
                b'ddit.com/reports" }]}'],
 b'Server': [b'snooserv'],
 b'Set-Cookie': [b'session_tracker=kijjhfblqpcpaadmer.0.1707642250866.Z0FBQUFBQ'
                 b'mx5STJLd3pMTnZNMVRORllCWUZybnh2NFJ2S0RyOHhwOExLN0dsZi1xb1dnW'
                 b'kZiYVQzenRaRmpCVUVjbFF6SWNRX2NpVGNMdUxmcWtCNlVIUVpHUl8xdEhZT'
                 b'DBaMGQ3RW5QZjVVLWRCajBuUTlqcm52a0NkcXh4eWR2VERMVUhZdVdxdEY; '
                 b'Domain=reddit.com; Max-Age=7199; Path=/; expires=Sun, 11-Feb'
                 b'-2024 11:04:10 GMT; secure',
                 b'session_tracker=kijjhfblqpcpaadmer.0.1707642250873.Z0FBQUFBQ'
                 b'mx5STJLTzc0ZS04dmsyWTBYY1EzQmc1Y0lpYUpaQ0ViR2ExZVdfMXBGTENoY'
                 b'y1MUmJmSWRPTDZGUFJ3cmQ3N3dfUU1rVlNWWlUydHg2eDFfZ2xlUml2VFNqZ'
                 b'mprUUdCTmE1SlpHTm1zNm5nZ01SVGNQMjQzWm10a2VNekNRSVhRVEZBWlo; '
                 b'Domain=reddit.com; Max-Age=7199; Path=/; expires=Sun, 11-Feb'
                 b'-2024 11:04:10 GMT; secure'],
 b'Vary': [b'accept-encoding'],
 b'Via': [b'1.1 varnish'],
 b'X-Frame-Options': [b'SAMEORIGIN'],
 b'X-Reddit-Internal-Ratelimit-Rls-Type': [b'loid-standard'],
 b'X-Reddit-Pod-Ip': [b'10.102.228.135:80'],
 b'X-Ua-Compatible': [b'IE=edge']}
```

_ChatGPT:_  
- **Accept-Ranges**: Indicates that the server accepts byte ranges.
- **Cache-Control**: Specifies caching directives for the client, indicating that the response is private and can be cached for a maximum of 3600 seconds (1 hour).
- **Content-Length**: Indicates the length of the content body in bytes, which is 8919 bytes in this case.
- **Content-Type**: Specifies the type of content being returned, which is text/html with UTF-8 charset.
- **Date**: Provides the date and time when the response was generated by the server.
- **Nel**: This seems to be a JSON-encoded value providing reporting information.
- **Report-To**: Another JSON-encoded value providing reporting information with different groups and endpoints.
- **Server**: Specifies the server software being used, in this case, "snooserv."
- **Set-Cookie**: Sets one or more cookies on the client-side. There are multiple cookies being set, each with its own attributes like domain, expiration time, and secure flag.
- **Vary**: Specifies what request header(s) were used to determine the response.
- **Via**: Indicates the intermediate protocols and recipients between the client making the request and the server that ultimately responds to it.
- **X-Frame-Options, X-Reddit-Internal-Ratelimit-Rls-Type, X-Reddit-Pod-Ip, X-Ua-Compatible**: These are additional headers providing various types of information, such as frame options, rate-limiting information, pod IP, and browser compatibility.

### Invoking the shell from spiders to inspect responses

Sometimes you want to inspect the responses that are being processed in a certain point of your spider, if only to check that response you expect is getting there.

This can be achieved by using the `scrapy.shell.inspect_response` function.

Here’s an example of how you would call it from your spider:

```python
import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response

            inspect_response(response, self)

        # Rest of parsing code.
```

When you run the spider, you will get something similar to this:

```ipython
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'
```

Then, you can check if the extraction code is working:

```ipython
>>> response.xpath('//h1[@class="fn"]')
[]
```

Nope, it doesn’t. So you can open the `response` in your web browser and see if it’s the response you were expecting:

```ipython
view(response)
True
```

Finally you hit `Ctrl-D` (or `Ctrl-Z` in Windows) to exit the shell and resume the crawling:

```ipython
^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...
```

Note that you can’t use the `fetch` shortcut here since the Scrapy engine is blocked by the `shell`. However, after you leave the `shell`, the spider will continue crawling where it stopped, as shown above.

# 2.7 Item Pipeline

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

Each **item pipeline component** (sometimes referred as just **“Item Pipeline”**) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

Typical uses of item pipelines are:
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database

## Writing your own item pipeline

Each item pipeline component is a Python class that must implement the following method:

- `process_item(self, item, spider)`
    - This method is called for every item pipeline component.
    - `item` is an item object, see [Supporting All Item Types](#Supporting-All-Item-Types).
    - `process_item()` must either: 
        - return an `item` object, 
        - return a `Deferred` or 
        - raise a `DropItem` exception.
    - Dropped items are no longer processed by further pipeline components.
    - **Parameters:**
        - `item (item object)` – the scraped item
        - `spider (Spider object)` – the spider which scraped the item

Additionally, they may also implement the following methods:

- `open_spider(self, spider)`
    - This method is called when the spider is opened.
    - **Parameters:**
        - `spider (Spider object)` – the spider which was opened<br>
</br>
- `close_spider(self, spider)`
    - This method is called when the spider is closed.
    - **Parameters:**
        - `spider (Spider object)` – the spider which was closed<br>
</br>
- `classmethodfrom_crawler(cls, crawler)`
    - If present, this class method is called to create a pipeline instance from a `Crawler`. It must return a new instance of the pipeline. `Crawler` object provides access to all Scrapy core components like `settings` and `signals`; it is a way for pipeline to access them and hook its functionality into Scrapy.
    - **Parameters**:
        - `crawler (Crawler object)` – crawler that uses this pipeline

## Item pipeline example

### Price validation and dropping items with no prices

Let’s take a look at the following hypothetical pipeline that adjusts the price attribute for those items that do not include VAT ("Value Added Tax", `price_excludes_vat` attribute), and drops those items which don’t contain a price:

```python
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem


class PricePipeline:
    vat_factor = 1.15

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter.get("price"):
            if adapter.get("price_excludes_vat"):
                adapter["price"] = adapter["price"] * self.vat_factor
            return item
        else:
            raise DropItem(f"Missing price in {item}")
```

### Write items to a `JSON lines` file

The following pipeline stores all scraped items (from all spiders) into a single `items.jsonl` file, containing one item per line serialized in JSON format:

```python
import json

from itemadapter import ItemAdapter


class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open("items.jsonl", "w")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        return item
```

> _Note:_ The purpose of `JsonWriterPipeline` is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the [Feed exports](#2.8-Feed-exports).

### Write items to MongoDB

In this example we’ll write items to MongoDB using `pymongo`. MongoDB address and database name are specified in Scrapy `settings`; MongoDB collection is named after item class.

The main point of this example is to show how to use `from_crawler()` method and how to clean up the resources properly.

```python
import pymongo
from itemadapter import ItemAdapter


class MongoPipeline:
    collection_name = "scrapy_items"

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get("MONGO_URI"),
            mongo_db=crawler.settings.get("MONGO_DATABASE", "items"),
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
        return item
```

### Take screenshot of `item`

This example demonstrates how to use coroutine syntax in the `process_item()` method.

This item pipeline makes a request to a locally-running instance of Splash to render a screenshot of the item URL. After the request response is downloaded, the item pipeline saves the screenshot to a file and adds the filename to the item.

```python
import hashlib
from pathlib import Path
from urllib.parse import quote

import scrapy
from itemadapter import ItemAdapter
from scrapy.http.request import NO_CALLBACK
from scrapy.utils.defer import maybe_deferred_to_future


class ScreenshotPipeline:
    """Pipeline that uses Splash to render screenshot of
    every Scrapy item."""

    SPLASH_URL = "http://localhost:8050/render.png?url={}"

    async def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        encoded_item_url = quote(adapter["url"])
        screenshot_url = self.SPLASH_URL.format(encoded_item_url)
        request = scrapy.Request(screenshot_url, callback=NO_CALLBACK)
        response = await maybe_deferred_to_future(
            spider.crawler.engine.download(request)
        )

        if response.status != 200:
            # Error happened, return item.
            return item

        # Save screenshot to file, filename will be hash of url.
        url = adapter["url"]
        url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
        filename = f"{url_hash}.png"
        Path(filename).write_bytes(response.body)

        # Store filename in item.
        adapter["screenshot_filename"] = filename
        return item
```

### Duplicates filter

A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique `id`, but our spider returns multiples items with the same `id`:

```python
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem


class DuplicatesPipeline:
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter["id"] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter["id"])
            return item
```

_ChatGPT:_  
In Python, `{item!r}` is a syntax used within a string to represent the value of the variable item using its "representation" (`!r` is a conversion flag, `__repr__()`).

The `!r` conversion flag is used to ensure that the value of item is represented in a way that is as close to its Python code representation as possible. This can be useful for debugging purposes, as it provides a string representation of the object that can be re-evaluated as valid Python code.

In the provided code snippet, `{item!r}` is used within the `DropItem` exception to include the representation of the item that caused the duplication. This allows for a more informative error message that helps in identifying which item is a duplicate.

## Activating an Item Pipeline component

To activate an Item Pipeline component you must add its class to the `ITEM_PIPELINES` setting, like in the following example:

```python
ITEM_PIPELINES = {
    "myproject.pipelines.PricePipeline": 300,
    "myproject.pipelines.JsonWriterPipeline": 800,
}
```

The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the `0-1000` range.

# 2.8 Feed exports

# 2.9 Requests and Responses

Scrapy uses `Request` and `Response` objects for crawling web sites.

Typically, `Request` objects are generated in the spiders and pass across the system until they reach the `Downloader`, which executes the `request` and returns a `Response` object which travels back to the spider that issued the request.

Both `Request` and `Response` classes have subclasses which add functionality not required in the base classes. These are described below in `Request` subclasses and `Response` subclasses.

## `Request` objects

### `scrapy.http.Request`

```python
class scrapy.http.Request(*args: Any, **kwargs: Any)
```

[source](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request)  
Represents an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a Response.

**Parameters**:

- `url (str)` – the URL of this request. If the URL is invalid, a `ValueError` exception is raised.
- `callback (collections.abc.Callable)` – the function that will be called with the response of this request (once it’s downloaded) as its first parameter. In addition to a function, the following values are supported: 
    - `None` (default), which indicates that the spider’s `parse()` method must be used.
    - `NO_CALLBACK()`
    For more information, see Passing additional data to callback functions.

> _Note:_ If exceptions are raised during processing, errback is called instead.

- `method (str)` – the HTTP method of this request. Defaults to '`GET`'.
- `meta (dict)` – the initial values for the `Request.meta` attribute. If given, the `dict` passed in this parameter will be shallow copied.
- `body (bytes or str)` – the request body. If a string is passed, then it’s encoded as bytes using the encoding passed (which defaults to `utf-8`). If body is not given, an empty bytes object is stored. Regardless of the type of this argument, the final value stored will be a bytes object (never a string or `None`).
- `headers (dict)` – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If `None` is passed as value, the HTTP header will not be sent at all. 

> **Caution**: Cookies set via the Cookie header are not considered by the CookiesMiddleware. If you need to set cookies for a request, use the Request.cookies parameter. This is a known current limitation that is being worked on.

- `cookies (dict or list)` – the request cookies. These can be sent in two forms.
    1. Using a `dict`:
```python
request_with_cookies = Request(
    url="http://www.example.com",
    cookies={"currency": "USD", "country": "UY"},
)
```
    2. Using a list of dicts:
```python
request_with_cookies = Request(
    url="http://www.example.com",
    cookies=[
        {
            "name": "currency",
            "value": "USD",
            "domain": "example.com",
            "path": "/currency",
        },
    ],
)
```

    The latter form allows for customizing the domain and path attributes of the cookie. This is only useful if the cookies are saved for later requests. When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That’s the typical behaviour of any regular web browser. Note that setting the `dont_merge_cookies` key to `True` in `request.meta` causes custom cookies to be ignored. For more info see [CookiesMiddleware](#CookiesMiddleware).

> **Caution**: Cookies set via the `Cookie` header are not considered by the `CookiesMiddleware`. If you need to set cookies for a request, use the `Request.cookies` parameter. This is a known current limitation that is being worked on.

_New in version 2.6.0: Cookie values that are `bool`, `float` or `int` are casted to `str`._

- `encoding (str)` – the encoding of this request (defaults to 'utf-8'). This encoding will be used to percent-encode the URL and to convert the body to bytes (if given as a string).
- `priority (int)` – the priority of this request (defaults to `0`). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
- `dont_filter (bool)` – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to `False`.
- `errback (collections.abc.Callable)` – a function that will be called if any exception was raised while processing the request. This includes pages that failed with `404` HTTP errors and such. It receives a `Failure` as first parameter. For more information, see Using errbacks to catch exceptions in request processing below.

_Changed in version 2.0: The `callback` parameter is no longer required when the `errback` parameter is specified._

- `flags (list)` – Flags sent to the request, can be used for logging or similar purposes.
- `cb_kwargs (dict)` – A `dict` with arbitrary data that will be passed as keyword arguments to the Request’s callback.

#### `url`

A string containing the URL of this request. Keep in mind that this attribute contains the escaped URL, so it can differ from the URL passed in the `__init__` method.

This attribute is read-only. To change the URL of a `Request` use `replace()`.

#### `method`

A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example: "GET", "POST", "PUT", etc

#### `headers`

A dictionary-like object which contains the request headers.

#### `body`

The request body as bytes.

This attribute is read-only. To change the body of a `Request` use `replace()`.

#### `meta= {}`

A dictionary of arbitrary metadata for the request.

You may extend request metadata as you see fit.

Request metadata can also be accessed through the `meta` attribute of a `response`.

To pass data from one spider callback to another, consider using `cb_kwargs` instead. However, request metadata may be the right choice in certain scenarios, such as to maintain some debugging data across all follow-up requests (e.g. the source URL).

A common use of request metadata is to define request-specific parameters for Scrapy components (extensions, middlewares, etc.). For example, if you set `dont_retry` to `True`, `RetryMiddleware` will never retry that request, even if it fails. See `Request.meta` special keys.

You may also use request metadata in your custom Scrapy components, for example, to keep request state information relevant to your component. For example, `RetryMiddleware` uses the `retry_times` metadata key to keep track of how many times a request has been retried so far.

Copying all the metadata of a previous request into a new, follow-up request in a spider callback is a bad practice, because request metadata may include metadata set by Scrapy components that is not meant to be copied into other requests. For example, copying the `retry_times` metadata key into follow-up requests can lower the amount of retries allowed for those follow-up requests.

You should only copy all request metadata from one request to another if the new request is meant to replace the old request, as is often the case when returning a request from a downloader middleware method.

Also mind that the `copy()` and `replace()` request methods [shallow-copy](https://docs.python.org/3/library/copy.html) request metadata.

#### `cb_kwargs`

A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the `Request`’s callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a `Response` object as argument.

This `dict` is shallow copied when the request is cloned using the `copy()` or `replace()` methods, and can also be accessed, in your spider, from the `response.cb_kwargs` attribute.

In case of a failure to process the request, this dict can be accessed as `failure.request.cb_kwargs` in the request’s `errback`. For more information, see **Accessing additional data in errback functions**.

#### `attributes`

```python
attributes: Tuple[str, ...]= ('url', 'callback', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'errback', 'flags', 'cb_kwargs')
```

A tuple of `str` objects containing the name of all public attributes of the class that are also keyword parameters of the `__init__` method.

Currently used by `Request.replace()`, `Request.to_dict()` and `request_from_dict()`.

#### `copy()`

[source](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request.copy)

Return a new `Request` which is a copy of this `Request`. See also: **Passing additional data to callback functions**.

#### `replace()`

```python
replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])
```
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request.replace)

Return a `Request` object with the same members, except for those members given new values by whichever keyword arguments are specified. The `Request.cb_kwargs` and `Request.meta` attributes are shallow copied by default (unless new values are given as arguments). See also [Passing additional data to callback functions](#Passing-additional-data-to-callback-functions).

#### `from_curl()`

```python
classmethod from_curl(curl_command: str, ignore_unknown_options: bool = True, **kwargs)→ RequestTypeVar
```
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request.from_curl)  
Create a `Request` object from a string containing a cURL command. It populates the HTTP method, the URL, the headers, the cookies and the body. It accepts the same arguments as the `Request` class, taking preference and overriding the values of the same arguments contained in the cURL command.

Unrecognized options are ignored by default. To raise an error when finding unknown options call this method by passing `ignore_unknown_options=False`.

> **Caution**: Using `from_curl()` from `Request` subclasses, such as `JSONRequest`, or `XmlRpcRequest`, as well as having downloader middlewares and spider middlewares enabled, such as `DefaultHeadersMiddleware`, `UserAgentMiddleware`, or `HttpCompressionMiddleware`, may modify the `Request` object.

To translate a cURL command into a Scrapy request, you may use `curl2scrapy`.

#### `to_dict()`

```python
to_dict(*, spider: Optional[Spider] = None)→ dict
```
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request.to_dict)  
Return a dictionary containing the `Request`’s data.

Use `request_from_dict()` to convert back into a `Request` object.

If a spider is given, this method will try to find out the name of the spider methods used as callback and errback and include them in the output dict, raising an exception if they cannot be found.

### Other functions related to requests

```python
scrapy.http.request.NO_CALLBACK(*args, **kwargs)
```
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#NO_CALLBACK)  
When assigned to the callback parameter of `Request`, it indicates that the request is not meant to have a spider callback at all.

For example:
```python
Request("https://example.com", callback=NO_CALLBACK)
```

This value should be used by components that create and handle their own requests, e.g. through `scrapy.core.engine.ExecutionEngine.download()`, so that downloader middlewares handling such requests can treat them differently from requests intended for the `parse()` callback.

```python
scrapy.utils.request.request_from_dict(d: dict, *, spider: Optional[Spider] = None)→ Request
```
[source](https://docs.scrapy.org/en/latest/_modules/scrapy/utils/request.html#request_from_dict)  
Create a `Request` object from a `dict`.

If a spider is given, it will try to resolve the callbacks looking at the spider for methods with the same name.

### Passing additional data to callback functions

The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded `Response` object as its first argument.

Example:

```python
def parse_page1(self, response):
    return scrapy.Request(
        "http://www.example.com/some_page.html", callback=self.parse_page2
    )


def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)
```

In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. The following example shows how to achieve this by using the `Request.cb_kwargs` attribute:

```python
def parse(self, response):
    request = scrapy.Request(
        "http://www.example.com/index.html",
        callback=self.parse_page2,
        cb_kwargs=dict(main_url=response.url),
    )
    request.cb_kwargs["foo"] = "bar"  # add more arguments for the callback
    yield request


def parse_page2(self, response, main_url, foo):
    yield dict(
        main_url=main_url,
        other_url=response.url,
        foo=foo,
    )
```

> **Caution**: `Request.cb_kwargs` was introduced in version 1.7. Prior to that, using `Request.meta` was recommended for passing information around callbacks. After 1.7, `Request.cb_kwargs` became the preferred way for handling user information, leaving `Request.meta` for communication with components like middlewares and extensions.

### Using `errbacks` to catch exceptions in `request` processing

The `errback` of a `request` is a function that will be called when an exception is raised while processing it.

It receives a `Failure` as first parameter and can be used to track connection establishment timeouts, DNS errors, etc.

Here’s an example spider logging all errors and catching some specific errors if needed:

```python
import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError


class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",  # HTTP 200 expected
        "http://www.httpbin.org/status/404",  # Not found error
        "http://www.httpbin.org/status/500",  # server issue
        "http://www.httpbin.org:12345/",  # non-responding host, timeout expected
        "https://example.invalid/",  # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(
                u,
                callback=self.parse_httpbin,
                errback=self.errback_httpbin,
                dont_filter=True,
            )

    def parse_httpbin(self, response):
        self.logger.info("Got successful response from {}".format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error("HttpError on %s", response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error("DNSLookupError on %s", request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error("TimeoutError on %s", request.url)
```

## `<Not finished yet>`

# 2.10 Link Extractors

A **link extractor** is an object that extracts links from `response`s.

The `__init__` method of `LxmlLinkExtractor` takes `settings` that determine which links may be extracted. `LxmlLinkExtractor.extract_links` returns a _list_ of matching **Link objects** from a `Response` object.

Link extractors are used in `CrawlSpider` spiders through a set of `Rule` objects.

You can also use link extractors in regular spiders. For example, you can instantiate `LinkExtractor` into a class variable in your spider, and use it from your spider callbacks:

```python
def parse(self, response):
    for link in self.link_extractor.extract_links(response):
        yield Request(link.url, callback=self.parse)
```

## Link extractor reference

The link extractor class is `scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor`. For convenience it can also be imported as `scrapy.linkextractors.LinkExtractor`:

```python
from scrapy.linkextractors import LinkExtractor
```

### `LxmlLinkExtractor`

```python
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, strip=True)
``` 

[source](https://docs.scrapy.org/en/latest/_modules/scrapy/linkextractors/lxmlhtml.html#LxmlLinkExtractor)

`LxmlLinkExtractor` is the recommended link extractor with handy filtering options. It is implemented using `lxml`’s robust `HTMLParser`.

**Parameters**
- `allow (str or list)` – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
- `deny (str or list)` – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (i.e. not extracted). It has precedence over the `allow` parameter. If not given (or empty) it won’t exclude any links.
- `allow_domains (str or list)` – a single value or a list of string containing domains which will be considered for extracting the links
- `deny_domains (str or list)` – a single value or a list of strings containing domains which won’t be considered for extracting the links
- `deny_extensions (list)` – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to `scrapy.linkextractors.IGNORED_EXTENSIONS`.

_Changed in version 2.0:_ `IGNORED_EXTENSIONS` now includes 7z, 7zip, apk, bz2, cdr, dmg, ico, iso, tar, tar.gz, webm, and xz.

- `restrict_xpaths (str or list)` – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
- `restrict_css (str or list)` – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as `restrict_xpaths`.
- `restrict_text (str or list)` – a single regular expression (or list of regular expressions) that the link’s `text` must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.
- `tags (str or list)` – a tag or a list of tags to consider when extracting links. Defaults to ('`a`', '`area`').
- `attrs (list)` – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the `tags` parameter). Defaults to `('href',)`
- `canonicalize (bool)` – canonicalize each extracted url (using `w3lib.url.canonicalize_url)`. Defaults to `False`. Note that `canonicalize_url` is meant for duplicate checking; it can change the URL visible at server side, so the `response` can be different for requests with canonicalized and raw URLs. If you’re using `LinkExtractor` to follow links it is more robust to keep the default `canonicalize=False`.
- `unique (bool)` – whether duplicate filtering should be applied to extracted links.
- `process_value (collections.abc.Callable)` – a function which receives each value extracted from the `tag` and `attributes` scanned and can modify the value and return a new one, or return `None` to ignore the link altogether. If not given, `process_value` defaults to `lambda x: x`.

For example, to extract links from this code:

```html
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
```

You can use the following function in `process_value`:

```python
def process_value(value):
    m = re.search(r"javascript:goToPage\('(.*?)'", value)
    if m:
        return m.group(1)
```

- `strip (bool)` – whether to strip whitespaces from extracted attributes. According to `HTML5` standard, leading and trailing whitespaces must be stripped from `href` attributes of `<a>`, `<area>` and many other elements, `src` attribute of `<img>`, `<iframe>` elements, etc., so `LinkExtractor` strips space chars by default. Set `strip=False` to turn it off (e.g. if you’re extracting urls from elements or attributes which allow leading/trailing whitespaces).

```python
extract_links(response)
```

[source](https://docs.scrapy.org/en/latest/_modules/scrapy/linkextractors/lxmlhtml.html#LxmlLinkExtractor.extract_links)
- Returns a list of `Link` objects from the specified `response`.
- Only links that match the `settings` passed to the `__init__` method of the link extractor are returned.
- Duplicate links are omitted if the `unique` attribute is set to `True`, otherwise they are returned.

### Link

```python
class scrapy.link.Link(url: str, text: str = '', fragment: str = '', nofollow: bool = False)
```

[source](https://docs.scrapy.org/en/latest/_modules/scrapy/link.html#Link)

`Link` objects represent an extracted link by the `LinkExtractor`.

Using the `anchor` tag sample below to illustrate the parameters:

```html
<a href="https://example.com/nofollow.html#foo" rel="nofollow">Dont follow this one</a>
```

**Parameters**
- `url` – the absolute url being linked to in the `anchor` tag. From the sample, this is `https://example.com/nofollow.html`.
- `text` – the text in the `anchor` tag. From the sample, this is Dont follow this one.
- `fragment` – the part of the url after the `hash` symbol. From the sample, this is `foo`.
- `nofollow` – an indication of the presence or absence of a nofollow value in the `rel` attribute of the `anchor` tag.

# 2.11 Settings

## Designating the settings

## Built-in settings reference

# <b>3. Built-in Services</b>

# 3.1 Logging

## Logging from Spiders

# <b>4. Solving specific problems</b>

# 4.4 Common practices

## Run Scrapy from a script

# 4.7 Selecting dynamically-loaded content

## `scrapy-playwright` -VR

### Installation

```sh
(venv) pip install scrapy-playwright
(venv) playwright install
```
```python
# settings.py

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

DOWNLOAD_HANDLERS = {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }
```

### `PageMethod`

In `scrapy-playwright`, the PageMethod class is a wrapper around a Playwright Page object. It provides a set of methods that allow you to interact with the web page, such as navigating to URLs, clicking on elements, filling out forms, extracting content, and more.

The Page class acts as a bridge between Scrapy and Playwright, enabling you to use Playwright's capabilities within your Scrapy spider. With the Page class, you can perform various actions on web pages, simulate user interactions, and extract data from the rendered page.

Here are some common methods provided by the Page class in scrapy-playwright:

goto(url): Navigates to the specified URL.
content(): Retrieves the raw HTML content of the page.
title(): Retrieves the title of the page.
click(selector): Clicks on the element matching the specified CSS selector.
wait_for_navigation(): Waits for the page to finish navigating.
text_content(selector): Retrieves the text content of the element matching the specified CSS selector.
attribute(selector, name): Retrieves the value of the specified attribute of the element matching the CSS selector.
set_content(html, url): Sets the HTML content of the page, useful for working with responses from Scrapy.
These methods provide you with the ability to interact with web pages in various ways, allowing you to scrape data efficiently and accurately.

# 4.8 Debugging memory leaks

## Debugging memory leaks with `trackref`

# 4.12 Benchmarking

# 4.13 Jobs: pausing and resuming crawls

Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.

Scrapy supports this functionality out of the box by providing the following facilities:
- a scheduler that persists scheduled requests on disk
- a duplicates filter that persists visited requests on disk
- an extension that keeps some spider state (key/value pairs) persistent between batches

## Job directory

To enable persistence support you just need to define a job directory through the `JOBDIR` setting. This directory will be for storing all required data to keep the state of a single job (i.e. a spider run). It’s important to note that 

> this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a _**single**_ job.

## How to use it

To start a spider with persistence support enabled, run it like this:

```sh
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
```
Then, you can stop the spider safely at any time (by pressing `Ctrl-C` or sending a signal), and resume it later by issuing the same command:

```sh
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
```

VR:  
Or just set the `JOBDIR = "./myjob"` in `settings.py` to automate the process. But remember:


_ChatGPT:_  
When you define `JOBDIR` in the `settings.py` file, Scrapy automatically reads this setting from the file when the crawler is started. Therefore, the setting is already applied globally to all spiders within your Scrapy project.

In the context of Scrapy, a "job directory" is a directory used to store all the required data to maintain the state of a single job, which refers to a single run of a spider. The `JOBDIR` setting allows you to specify the location of this directory.

The statement "_or even different jobs/runs of the same spider_" refers to the fact that each run of a spider can have its own job directory, and these directories should not be shared among different runs of the same spider.

Here's an explanation of the situation:

- **Different jobs/runs of the same spider:**
    - Each time you run a spider, it constitutes a separate job or run.
    - For example, if you run a spider today and then run the same spider tomorrow with different configurations or parameters, each of these runs is considered a separate job.
    - Each job can have its own job directory to store its state, such as crawled data, logs, and other metadata specific to that run.
- **Why job directories should not be shared:**
    - Job directories are meant to maintain the state of a single job, and sharing them between different jobs/runs can lead to unintended consequences.
    - If two jobs share the same job directory, they may overwrite each other's state, leading to data corruption, incorrect results, or other issues.

For example, if you run a spider today and then run the same spider again tomorrow using the same job directory, the second run may overwrite or interfere with the state of the first run, causing unexpected behavior.

To avoid these issues, it's essential to ensure that each job/run of a spider has its own separate job directory. This ensures isolation between runs and maintains the integrity of the data and state for each individual job.

## Keeping persistent state between batches

Sometimes you’ll want to keep some persistent spider state between pause/resume batches. You can use the `spider.state` attribute for that, which should be a `dict`. There’s a built-in extension that takes care of serializing, storing and loading that attribute from the job directory, when the spider starts and stops.

Here’s an example of a callback that uses the spider state (other spider code is omitted for brevity):

```python
def parse_item(self, response):
    # parse item here
    self.state["items_count"] = self.state.get("items_count", 0) + 1
```

Should be written in the `spider.py`.

## Pass `self.state[value]` values from `parse()` to pipelines

Just use `spider.state[value]` within the pipeline - it will access the `state` class argument.

Be careful though - the state values cannot be effectively instantiated in the spider's `__init__` method, because the go to the pipeline's `open_spider()` method and then they just disappear. Instead, instantiate them withint the `parse` method and below (see the diagram below) - then they will be saved. Thus, all of your counters will be smashed throughout the code... 

```
DEBUG: spider: SomeSpider init -> self.state = dict()
DEBUG: pipeline: from_crawler class method
DEBUG: pipeline: SomePipeline init
DEBUG: pipeline: open_spider -> spider.state is accessible
DEBUG: spider: start_requests -> self.state is not accessible anymore
DEBUG: spider: parse
DEBUG: spider: parse_item
DEBUG: pipeline: process_item
DEBUG: pipeline: close_spider
DEBUG: spider: closed
```

> Note: If you instantiate `self.state` within the spider's `init` method, it will be counted as another `self.state` argument unlike the one that you define within the `parse` and below methods. In general, for March 1st 2024, the spider's `init` arguments are set anew each run, unlike the `state` values that instantiated within `parse` and below methods - they are saved as the permanent `state`. So, be cautious.

## Persistence gotchas

There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:

### Cookies expiration

Cookies may expire. So, if you don’t resume your spider quickly the requests scheduled may no longer work. This won’t be an issue if your spider doesn’t rely on cookies.

### Request serialization

For persistence to work, `Request` objects must be serializable with [`pickle`](https://docs.python.org/3/library/pickle.html#module-pickle), except for the `callback` and `errback` values passed to their `__init__` method, which must be methods of the running `Spider` class.

If you wish to log the requests that couldn’t be serialized, you can set the `SCHEDULER_DEBUG` setting to `True` in the project’s settings page. It is `False` by default.

# <b>5. Extending Scrapy</b>

# 5.2 Add-ons

## Built-in downloader middleware reference

### `CookiesMiddleware`

# 5.4 Spider Middleware

# 5.8 Item Exporters

## Serialization of item fields

# 5.10 Core API

## Crawler API

# <b>6. All the rest</b>

# <b>7. Source code</b>

## `scrapy.spiders.crawl`

```python
"""
This modules implements the CrawlSpider which is the recommended spider to use
for scraping typical web sites that requires crawling pages.

See documentation in docs/topics/spiders.rst
"""

import copy
from typing import AsyncIterable, Awaitable, Sequence

from scrapy.http import HtmlResponse, Request, Response
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Spider
from scrapy.utils.asyncgen import collect_asyncgen
from scrapy.utils.spider import iterate_spider_output


def _identity(x):
    return x


def _identity_process_request(request, response):
    return request


def _get_method(method, spider):
    if callable(method):
        return method
    if isinstance(method, str):
        return getattr(spider, method, None)


_default_link_extractor = LinkExtractor()


[docs]class Rule:
    def __init__(
        self,
        link_extractor=None,
        callback=None,
        cb_kwargs=None,
        follow=None,
        process_links=None,
        process_request=None,
        errback=None,
    ):
        self.link_extractor = link_extractor or _default_link_extractor
        self.callback = callback
        self.errback = errback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links or _identity
        self.process_request = process_request or _identity_process_request
        self.follow = follow if follow is not None else not callback

    def _compile(self, spider):
        self.callback = _get_method(self.callback, spider)
        self.errback = _get_method(self.errback, spider)
        self.process_links = _get_method(self.process_links, spider)
        self.process_request = _get_method(self.process_request, spider)



[docs]class CrawlSpider(Spider):
    rules: Sequence[Rule] = ()

    def __init__(self, *a, **kw):
        super().__init__(*a, **kw)
        self._compile_rules()

    def _parse(self, response, **kwargs):
        return self._parse_response(
            response=response,
            callback=self.parse_start_url,
            cb_kwargs=kwargs,
            follow=True,
        )

[docs]    def parse_start_url(self, response, **kwargs):
        return []


    def process_results(self, response: Response, results: list):
        return results

    def _build_request(self, rule_index, link):
        return Request(
            url=link.url,
            callback=self._callback,
            errback=self._errback,
            meta=dict(rule=rule_index, link_text=link.text),
        )

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def _callback(self, response, **cb_kwargs):
        rule = self._rules[response.meta["rule"]]
        return self._parse_response(
            response, rule.callback, {**rule.cb_kwargs, **cb_kwargs}, rule.follow
        )

    def _errback(self, failure):
        rule = self._rules[failure.request.meta["rule"]]
        return self._handle_failure(failure, rule.errback)

    async def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            if isinstance(cb_res, AsyncIterable):
                cb_res = await collect_asyncgen(cb_res)
            elif isinstance(cb_res, Awaitable):
                cb_res = await cb_res
            cb_res = self.process_results(response, cb_res)
            for request_or_item in iterate_spider_output(cb_res):
                yield request_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _handle_failure(self, failure, errback):
        if errback:
            results = errback(failure) or ()
            for request_or_item in iterate_spider_output(results):
                yield request_or_item

    def _compile_rules(self):
        self._rules = []
        for rule in self.rules:
            self._rules.append(copy.copy(rule))
            self._rules[-1]._compile(self)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            "CRAWLSPIDER_FOLLOW_LINKS", True
        )
        return spider
```

## `scrapy.item`

```python
"""
Scrapy Item

See documentation in docs/topics/item.rst
"""

from abc import ABCMeta
from collections.abc import MutableMapping
from copy import deepcopy
from pprint import pformat
from typing import Dict

from scrapy.utils.trackref import object_ref


class Field(dict):
    """Container of field metadata"""


[docs]class ItemMeta(ABCMeta):
    """Metaclass_ of :class:`Item` that handles field definitions.

    .. _metaclass: https://realpython.com/python-metaclasses
    """

    def __new__(mcs, class_name, bases, attrs):
        classcell = attrs.pop("__classcell__", None)
        new_bases = tuple(base._class for base in bases if hasattr(base, "_class"))
        _class = super().__new__(mcs, "x_" + class_name, new_bases, attrs)

        fields = getattr(_class, "fields", {})
        new_attrs = {}
        for n in dir(_class):
            v = getattr(_class, n)
            if isinstance(v, Field):
                fields[n] = v
            elif n in attrs:
                new_attrs[n] = attrs[n]

        new_attrs["fields"] = fields
        new_attrs["_class"] = _class
        if classcell is not None:
            new_attrs["__classcell__"] = classcell
        return super().__new__(mcs, class_name, bases, new_attrs)



class Item(MutableMapping, object_ref, metaclass=ItemMeta):
    """
    Base class for scraped items.

    In Scrapy, an object is considered an ``item`` if it is an instance of either
    :class:`Item` or :class:`dict`, or any subclass. For example, when the output of a
    spider callback is evaluated, only instances of :class:`Item` or
    :class:`dict` are passed to :ref:`item pipelines <topics-item-pipeline>`.

    If you need instances of a custom class to be considered items by Scrapy,
    you must inherit from either :class:`Item` or :class:`dict`.

    Items must declare :class:`Field` attributes, which are processed and stored
    in the ``fields`` attribute. This restricts the set of allowed field names
    and prevents typos, raising ``KeyError`` when referring to undefined fields.
    Additionally, fields can be used to define metadata and control the way
    data is processed internally. Please refer to the :ref:`documentation
    about fields <topics-items-fields>` for additional information.

    Unlike instances of :class:`dict`, instances of :class:`Item` may be
    :ref:`tracked <topics-leaks-trackrefs>` to debug memory leaks.
    """

    fields: Dict[str, Field]

    def __init__(self, *args, **kwargs):
        self._values = {}
        if args or kwargs:  # avoid creating dict for most common case
            for k, v in dict(*args, **kwargs).items():
                self[k] = v

    def __getitem__(self, key):
        return self._values[key]

    def __setitem__(self, key, value):
        if key in self.fields:
            self._values[key] = value
        else:
            raise KeyError(f"{self.__class__.__name__} does not support field: {key}")

    def __delitem__(self, key):
        del self._values[key]

    def __getattr__(self, name):
        if name in self.fields:
            raise AttributeError(f"Use item[{name!r}] to get field value")
        raise AttributeError(name)

    def __setattr__(self, name, value):
        if not name.startswith("_"):
            raise AttributeError(f"Use item[{name!r}] = {value!r} to set field value")
        super().__setattr__(name, value)

    def __len__(self):
        return len(self._values)

    def __iter__(self):
        return iter(self._values)

    __hash__ = object_ref.__hash__

    def keys(self):
        return self._values.keys()

    def __repr__(self):
        return pformat(dict(self))

    def copy(self):
        return self.__class__(self)

    def deepcopy(self):
        """Return a :func:`~copy.deepcopy` of this item."""
        return deepcopy(self)
```

## `scrapy.loader`

```python
"""
Item Loader

See documentation in docs/topics/loaders.rst
"""
import itemloaders

from scrapy.item import Item
from scrapy.selector import Selector


class ItemLoader(itemloaders.ItemLoader):
    """
    A user-friendly abstraction to populate an :ref:`item <topics-items>` with data
    by applying :ref:`field processors <topics-loaders-processors>` to scraped data.
    When instantiated with a ``selector`` or a ``response`` it supports
    data extraction from web pages using :ref:`selectors <topics-selectors>`.

    :param item: The item instance to populate using subsequent calls to
        :meth:`~ItemLoader.add_xpath`, :meth:`~ItemLoader.add_css`,
        or :meth:`~ItemLoader.add_value`.
    :type item: scrapy.item.Item

    :param selector: The selector to extract data from, when using the
        :meth:`add_xpath`, :meth:`add_css`, :meth:`replace_xpath`, or
        :meth:`replace_css` method.
    :type selector: :class:`~scrapy.selector.Selector` object

    :param response: The response used to construct the selector using the
        :attr:`default_selector_class`, unless the selector argument is given,
        in which case this argument is ignored.
    :type response: :class:`~scrapy.http.Response` object

    If no item is given, one is instantiated automatically using the class in
    :attr:`default_item_class`.

    The item, selector, response and remaining keyword arguments are
    assigned to the Loader context (accessible through the :attr:`context` attribute).

    .. attribute:: item

        The item object being parsed by this Item Loader.
        This is mostly used as a property so, when attempting to override this
        value, you may want to check out :attr:`default_item_class` first.

    .. attribute:: context

        The currently active :ref:`Context <loaders-context>` of this Item Loader.

    .. attribute:: default_item_class

        An :ref:`item <topics-items>` class (or factory), used to instantiate
        items when not given in the ``__init__`` method.

    .. attribute:: default_input_processor

        The default input processor to use for those fields which don't specify
        one.

    .. attribute:: default_output_processor

        The default output processor to use for those fields which don't specify
        one.

    .. attribute:: default_selector_class

        The class used to construct the :attr:`selector` of this
        :class:`ItemLoader`, if only a response is given in the ``__init__`` method.
        If a selector is given in the ``__init__`` method this attribute is ignored.
        This attribute is sometimes overridden in subclasses.

    .. attribute:: selector

        The :class:`~scrapy.selector.Selector` object to extract data from.
        It's either the selector given in the ``__init__`` method or one created from
        the response given in the ``__init__`` method using the
        :attr:`default_selector_class`. This attribute is meant to be
        read-only.
    """

    default_item_class = Item
    default_selector_class = Selector

    def __init__(self, item=None, selector=None, response=None, parent=None, **context):
        if selector is None and response is not None:
            try:
                selector = self.default_selector_class(response)
            except AttributeError:
                selector = None
        context.update(response=response)
        super().__init__(item=item, selector=selector, parent=parent, **context)
```

## `itemadapter.adapter`

```python
import dataclasses
from abc import abstractmethod, ABCMeta
from collections import deque
from collections.abc import KeysView, MutableMapping
from types import MappingProxyType
from typing import Any, Deque, Iterator, Type, Optional, List

from itemadapter.utils import (
    _get_pydantic_model_metadata,
    _is_attrs_class,
    _is_pydantic_model,
)

from itemadapter._imports import attr, _scrapy_item_classes


__all__ = [
    "AdapterInterface",
    "AttrsAdapter",
    "DataclassAdapter",
    "DictAdapter",
    "ItemAdapter",
    "PydanticAdapter",
    "ScrapyItemAdapter",
]


class AdapterInterface(MutableMapping, metaclass=ABCMeta):
    """Abstract Base Class for adapters.

    An adapter that handles a specific type of item should inherit from this
    class and implement the abstract methods defined here, plus the
    abtract methods inherited from the MutableMapping base class.
    """

    def __init__(self, item: Any) -> None:
        self.item = item

    @classmethod
    @abstractmethod
    def is_item_class(cls, item_class: type) -> bool:
        """Return True if the adapter can handle the given item class, False otherwise."""
        raise NotImplementedError()

    @classmethod
    def is_item(cls, item: Any) -> bool:
        """Return True if the adapter can handle the given item, False otherwise."""
        return cls.is_item_class(item.__class__)

    @classmethod
    def get_field_meta_from_class(cls, item_class: type, field_name: str) -> MappingProxyType:
        return MappingProxyType({})

    @classmethod
    def get_field_names_from_class(cls, item_class: type) -> Optional[List[str]]:
        """Return a list of fields defined for ``item_class``.
        If a class doesn't support fields, None is returned."""
        return None

    def get_field_meta(self, field_name: str) -> MappingProxyType:
        """Return metadata for the given field name, if available."""
        return self.get_field_meta_from_class(self.item.__class__, field_name)

    def field_names(self) -> KeysView:
        """Return a dynamic view of the item's field names."""
        return self.keys()  # type: ignore[return-value]


class _MixinAttrsDataclassAdapter:

    _fields_dict: dict
    item: Any

    def get_field_meta(self, field_name: str) -> MappingProxyType:
        return self._fields_dict[field_name].metadata

    def field_names(self) -> KeysView:
        return KeysView(self._fields_dict)

    def __getitem__(self, field_name: str) -> Any:
        if field_name in self._fields_dict:
            return getattr(self.item, field_name)
        raise KeyError(field_name)

    def __setitem__(self, field_name: str, value: Any) -> None:
        if field_name in self._fields_dict:
            setattr(self.item, field_name, value)
        else:
            raise KeyError(f"{self.item.__class__.__name__} does not support field: {field_name}")

    def __delitem__(self, field_name: str) -> None:
        if field_name in self._fields_dict:
            try:
                delattr(self.item, field_name)
            except AttributeError:
                raise KeyError(field_name)
        else:
            raise KeyError(f"{self.item.__class__.__name__} does not support field: {field_name}")

    def __iter__(self) -> Iterator:
        return iter(attr for attr in self._fields_dict if hasattr(self.item, attr))

    def __len__(self) -> int:
        return len(list(iter(self)))


class AttrsAdapter(_MixinAttrsDataclassAdapter, AdapterInterface):
    def __init__(self, item: Any) -> None:
        super().__init__(item)
        if attr is None:
            raise RuntimeError("attr module is not available")
        # store a reference to the item's fields to avoid O(n) lookups and O(n^2) traversals
        self._fields_dict = attr.fields_dict(self.item.__class__)

    @classmethod
    def is_item(cls, item: Any) -> bool:
        return _is_attrs_class(item) and not isinstance(item, type)

    @classmethod
    def is_item_class(cls, item_class: type) -> bool:
        return _is_attrs_class(item_class)

    @classmethod
    def get_field_meta_from_class(cls, item_class: type, field_name: str) -> MappingProxyType:
        if attr is None:
            raise RuntimeError("attr module is not available")
        try:
            return attr.fields_dict(item_class)[field_name].metadata  # type: ignore
        except KeyError:
            raise KeyError(f"{item_class.__name__} does not support field: {field_name}")

    @classmethod
    def get_field_names_from_class(cls, item_class: type) -> Optional[List[str]]:
        if attr is None:
            raise RuntimeError("attr module is not available")
        return [a.name for a in attr.fields(item_class)]


class DataclassAdapter(_MixinAttrsDataclassAdapter, AdapterInterface):
    def __init__(self, item: Any) -> None:
        super().__init__(item)
        # store a reference to the item's fields to avoid O(n) lookups and O(n^2) traversals
        self._fields_dict = {field.name: field for field in dataclasses.fields(self.item)}

    @classmethod
    def is_item(cls, item: Any) -> bool:
        return dataclasses.is_dataclass(item) and not isinstance(item, type)

    @classmethod
    def is_item_class(cls, item_class: type) -> bool:
        return dataclasses.is_dataclass(item_class)

    @classmethod
    def get_field_meta_from_class(cls, item_class: type, field_name: str) -> MappingProxyType:
        for field in dataclasses.fields(item_class):
            if field.name == field_name:
                return field.metadata  # type: ignore
        raise KeyError(f"{item_class.__name__} does not support field: {field_name}")

    @classmethod
    def get_field_names_from_class(cls, item_class: type) -> Optional[List[str]]:
        return [a.name for a in dataclasses.fields(item_class)]


class PydanticAdapter(AdapterInterface):

    item: Any

    @classmethod
    def is_item_class(cls, item_class: type) -> bool:
        return _is_pydantic_model(item_class)

    @classmethod
    def get_field_meta_from_class(cls, item_class: type, field_name: str) -> MappingProxyType:
        try:
            return _get_pydantic_model_metadata(item_class, field_name)
        except KeyError:
            raise KeyError(f"{item_class.__name__} does not support field: {field_name}")

    @classmethod
    def get_field_names_from_class(cls, item_class: type) -> Optional[List[str]]:
        return list(item_class.__fields__.keys())  # type: ignore[attr-defined]

    def field_names(self) -> KeysView:
        return KeysView(self.item.__fields__)

    def __getitem__(self, field_name: str) -> Any:
        if field_name in self.item.__fields__:
            return getattr(self.item, field_name)
        raise KeyError(field_name)

    def __setitem__(self, field_name: str, value: Any) -> None:
        if field_name in self.item.__fields__:
            setattr(self.item, field_name, value)
        else:
            raise KeyError(f"{self.item.__class__.__name__} does not support field: {field_name}")

    def __delitem__(self, field_name: str) -> None:
        if field_name in self.item.__fields__:
            try:
                delattr(self.item, field_name)
            except AttributeError:
                raise KeyError(field_name)
        else:
            raise KeyError(f"{self.item.__class__.__name__} does not support field: {field_name}")

    def __iter__(self) -> Iterator:
        return iter(attr for attr in self.item.__fields__ if hasattr(self.item, attr))

    def __len__(self) -> int:
        return len(list(iter(self)))


class _MixinDictScrapyItemAdapter:

    _fields_dict: dict
    item: Any

    def __getitem__(self, field_name: str) -> Any:
        return self.item[field_name]

    def __setitem__(self, field_name: str, value: Any) -> None:
        self.item[field_name] = value

    def __delitem__(self, field_name: str) -> None:
        del self.item[field_name]

    def __iter__(self) -> Iterator:
        return iter(self.item)

    def __len__(self) -> int:
        return len(self.item)


class DictAdapter(_MixinDictScrapyItemAdapter, AdapterInterface):
    @classmethod
    def is_item(cls, item: Any) -> bool:
        return isinstance(item, dict)

    @classmethod
    def is_item_class(cls, item_class: type) -> bool:
        return issubclass(item_class, dict)

    def field_names(self) -> KeysView:
        return KeysView(self.item)


class ScrapyItemAdapter(_MixinDictScrapyItemAdapter, AdapterInterface):
    @classmethod
    def is_item(cls, item: Any) -> bool:
        return isinstance(item, _scrapy_item_classes)

    @classmethod
    def is_item_class(cls, item_class: type) -> bool:
        return issubclass(item_class, _scrapy_item_classes)

    @classmethod
    def get_field_meta_from_class(cls, item_class: type, field_name: str) -> MappingProxyType:
        return MappingProxyType(item_class.fields[field_name])  # type: ignore[attr-defined]

    @classmethod
    def get_field_names_from_class(cls, item_class: type) -> Optional[List[str]]:
        return list(item_class.fields.keys())  # type: ignore[attr-defined]

    def field_names(self) -> KeysView:
        return KeysView(self.item.fields)


[docs]class ItemAdapter(MutableMapping):
    """Wrapper class to interact with data container objects. It provides a common interface
    to extract and set data without having to take the object's type into account.
    """

    ADAPTER_CLASSES: Deque[Type[AdapterInterface]] = deque(
        [
            ScrapyItemAdapter,
            DictAdapter,
            DataclassAdapter,
            AttrsAdapter,
            PydanticAdapter,
        ]
    )

    def __init__(self, item: Any) -> None:
        for cls in self.ADAPTER_CLASSES:
            if cls.is_item(item):
                self.adapter = cls(item)
                break
        else:
            raise TypeError(f"No adapter found for objects of type: {type(item)} ({item})")

    @classmethod
    def is_item(cls, item: Any) -> bool:
        for adapter_class in cls.ADAPTER_CLASSES:
            if adapter_class.is_item(item):
                return True
        return False

    @classmethod
    def is_item_class(cls, item_class: type) -> bool:
        for adapter_class in cls.ADAPTER_CLASSES:
            if adapter_class.is_item_class(item_class):
                return True
        return False

    @classmethod
    def _get_adapter_class(cls, item_class: type) -> Type[AdapterInterface]:
        for adapter_class in cls.ADAPTER_CLASSES:
            if adapter_class.is_item_class(item_class):
                return adapter_class
        raise TypeError(f"{item_class} is not a valid item class")

    @classmethod
    def get_field_meta_from_class(cls, item_class: type, field_name: str) -> MappingProxyType:
        adapter_class = cls._get_adapter_class(item_class)
        return adapter_class.get_field_meta_from_class(item_class, field_name)

    @classmethod
    def get_field_names_from_class(cls, item_class: type) -> Optional[List[str]]:
        adapter_class = cls._get_adapter_class(item_class)
        return adapter_class.get_field_names_from_class(item_class)

    @property
    def item(self) -> Any:
        return self.adapter.item

    def __repr__(self) -> str:
        values = ", ".join([f"{key}={value!r}" for key, value in self.items()])
        return f"<{self.__class__.__name__} for {self.item.__class__.__name__}({values})>"

    def __getitem__(self, field_name: str) -> Any:
        return self.adapter.__getitem__(field_name)

    def __setitem__(self, field_name: str, value: Any) -> None:
        self.adapter.__setitem__(field_name, value)

    def __delitem__(self, field_name: str) -> None:
        self.adapter.__delitem__(field_name)

    def __iter__(self) -> Iterator:
        return self.adapter.__iter__()

    def __len__(self) -> int:
        return self.adapter.__len__()

    def get_field_meta(self, field_name: str) -> MappingProxyType:
        """Return metadata for the given field name."""
        return self.adapter.get_field_meta(field_name)

    def field_names(self) -> KeysView:
        """Return read-only key view with the names of all the defined fields for the item."""
        return self.adapter.field_names()

    def asdict(self) -> dict:
        """Return a dict object with the contents of the adapter. This works slightly different
        than calling `dict(adapter)`: it's applied recursively to nested items (if there are any).
        """
        return {key: self._asdict(value) for key, value in self.items()}

    @classmethod
    def _asdict(cls, obj: Any) -> Any:
        if isinstance(obj, dict):
            return {key: cls._asdict(value) for key, value in obj.items()}
        if isinstance(obj, (list, set, tuple)):
            return obj.__class__(cls._asdict(x) for x in obj)
        if isinstance(obj, cls):
            return obj.asdict()
        if cls.is_item(obj):
            return cls(obj).asdict()
        return obj
```

## `itemloaders`

```python
"""
Item Loader

See documentation in docs/topics/loaders.rst
"""
from contextlib import suppress

from itemadapter import ItemAdapter
from parsel.utils import extract_regex, flatten

from itemloaders.common import wrap_loader_context
from itemloaders.processors import Identity
from itemloaders.utils import arg_to_iter


def unbound_method(method):
    """
    Allow to use single-argument functions as input or output processors
    (no need to define an unused first 'self' argument)
    """
    with suppress(AttributeError):
        if "." not in method.__qualname__:
            return method.__func__
    return method


class ItemLoader:
    """
    Return a new Item Loader for populating the given item. If no item is
    given, one is instantiated automatically using the class in
    :attr:`default_item_class`.

    When instantiated with a :param ``selector`` parameter the :class:`ItemLoader` class
    provides convenient mechanisms for extracting data from web pages
    using parsel_ selectors.

    :param item: The item instance to populate using subsequent calls to
        :meth:`~ItemLoader.add_xpath`, :meth:`~ItemLoader.add_css`,
        :meth:`~ItemLoader.add_jmes` or :meth:`~ItemLoader.add_value`.
    :type item: :class:`dict` object

    :param selector: The selector to extract data from, when using the
        :meth:`add_xpath` (resp. :meth:`add_css`, :meth:`add_jmes`) or :meth:`replace_xpath`
        (resp. :meth:`replace_css`, :meth:`replace_jmes`) method.
    :type selector: :class:`~parsel.selector.Selector` object

    The item, selector and the remaining keyword arguments are
    assigned to the Loader context (accessible through the :attr:`context` attribute).

    .. attribute:: item

        The item object being parsed by this Item Loader.
        This is mostly used as a property so when attempting to override this
        value, you may want to check out :attr:`default_item_class` first.

    .. attribute:: context

        The currently active :ref:`Context <loaders-context>` of this Item Loader.
        Refer to <loaders-context> for more information about the Loader Context.

    .. attribute:: default_item_class

        An Item class (or factory), used to instantiate items when not given in
        the ``__init__`` method.

        .. warning:: Currently, this factory/class needs to be
            callable/instantiated without any arguments.
            If you are using ``dataclasses``, please consider the following
            alternative::

                from dataclasses import dataclass, field
                from typing import Optional

                @dataclass
                class Product:
                    name: Optional[str] = field(default=None)
                    price: Optional[float] = field(default=None)

    .. attribute:: default_input_processor

        The default input processor to use for those fields which don't specify
        one.

    .. attribute:: default_output_processor

        The default output processor to use for those fields which don't specify
        one.

    .. attribute:: selector

        The :class:`~parsel.selector.Selector` object to extract data from.
        It's the selector given in the ``__init__`` method.
        This attribute is meant to be read-only.

    .. _parsel: https://parsel.readthedocs.io/en/latest/
    """

    default_item_class = dict
    default_input_processor = Identity()
    default_output_processor = Identity()

    def __init__(self, item=None, selector=None, parent=None, **context):
        self.selector = selector
        context.update(selector=selector)
        if item is None:
            item = self.default_item_class()
        self._local_item = item
        context["item"] = item
        self.context = context
        self.parent = parent
        self._local_values = {}
        # values from initial item
        for field_name, value in ItemAdapter(item).items():
            self._values.setdefault(field_name, [])
            self._values[field_name] += arg_to_iter(value)

    @property
    def _values(self):
        if self.parent is not None:
            return self.parent._values
        else:
            return self._local_values

    @property
    def item(self):
        if self.parent is not None:
            return self.parent.item
        else:
            return self._local_item

    def nested_xpath(self, xpath, **context):
        """
        Create a nested loader with an xpath selector.
        The supplied selector is applied relative to selector associated
        with this :class:`ItemLoader`. The nested loader shares the item
        with the parent :class:`ItemLoader` so calls to :meth:`add_xpath`,
        :meth:`add_value`, :meth:`replace_value`, etc. will behave as expected.
        """
        selector = self.selector.xpath(xpath)
        context.update(selector=selector)
        subloader = self.__class__(item=self.item, parent=self, **context)
        return subloader


    def nested_css(self, css, **context):
        """
        Create a nested loader with a css selector.
        The supplied selector is applied relative to selector associated
        with this :class:`ItemLoader`. The nested loader shares the item
        with the parent :class:`ItemLoader` so calls to :meth:`add_xpath`,
        :meth:`add_value`, :meth:`replace_value`, etc. will behave as expected.
        """
        selector = self.selector.css(css)
        context.update(selector=selector)
        subloader = self.__class__(item=self.item, parent=self, **context)
        return subloader


    def add_value(self, field_name, value, *processors, re=None, **kw):
        """
        Process and then add the given ``value`` for the given field.

        The value is first passed through :meth:`get_value` by giving the
        ``processors`` and ``kwargs``, and then passed through the
        :ref:`field input processor <processors>` and its result
        appended to the data collected for that field. If the field already
        contains collected data, the new data is added.

        The given ``field_name`` can be ``None``, in which case values for
        multiple fields may be added. And the processed value should be a dict
        with field_name mapped to values.

        Examples::

            loader.add_value('name', 'Color TV')
            loader.add_value('colours', ['white', 'blue'])
            loader.add_value('length', '100')
            loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)')
            loader.add_value(None, {'name': 'foo', 'sex': 'male'})
        """
        value = self.get_value(value, *processors, re=re, **kw)
        if value is None:
            return
        if not field_name:
            for k, v in value.items():
                self._add_value(k, v)
        else:
            self._add_value(field_name, value)


    def replace_value(self, field_name, value, *processors, re=None, **kw):
        """
        Similar to :meth:`add_value` but replaces the collected data with the
        new value instead of adding it.
        """
        value = self.get_value(value, *processors, re=re, **kw)
        if value is None:
            return
        if not field_name:
            for k, v in value.items():
                self._replace_value(k, v)
        else:
            self._replace_value(field_name, value)


    def _add_value(self, field_name, value):
        value = arg_to_iter(value)
        processed_value = self._process_input_value(field_name, value)
        if processed_value:
            self._values.setdefault(field_name, [])
            self._values[field_name] += arg_to_iter(processed_value)

    def _replace_value(self, field_name, value):
        self._values.pop(field_name, None)
        self._add_value(field_name, value)

    def get_value(self, value, *processors, re=None, **kw):
        """
        Process the given ``value`` by the given ``processors`` and keyword
        arguments.

        Available keyword arguments:

        :param re: a regular expression to use for extracting data from the
            given value using :func:`~parsel.utils.extract_regex` method,
            applied before processors
        :type re: str or typing.Pattern

        Examples:

        >>> from itemloaders import ItemLoader
        >>> from itemloaders.processors import TakeFirst
        >>> loader = ItemLoader()
        >>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)')
        'FOO'
        """
        if re:
            value = arg_to_iter(value)
            value = flatten(extract_regex(re, x) for x in value)

        for proc in processors:
            if value is None:
                break
            _proc = proc
            proc = wrap_loader_context(proc, self.context)
            try:
                value = proc(value)
            except Exception as e:
                raise ValueError(
                    "Error with processor %s value=%r error='%s: %s'"
                    % (_proc.__class__.__name__, value, type(e).__name__, str(e))
                ) from e
        return value


    def load_item(self):
        """
        Populate the item with the data collected so far, and return it. The
        data collected is first passed through the :ref:`output processors
        <processors>` to get the final value to assign to each item field.
        """
        adapter = ItemAdapter(self.item)
        for field_name in tuple(self._values):
            value = self.get_output_value(field_name)
            if value is not None:
                adapter[field_name] = value

        return adapter.item


    def get_output_value(self, field_name):
        """
        Return the collected values parsed using the output processor, for the
        given field. This method doesn't populate or modify the item at all.
        """
        proc = self.get_output_processor(field_name)
        proc = wrap_loader_context(proc, self.context)
        value = self._values.get(field_name, [])
        try:
            return proc(value)
        except Exception as e:
            raise ValueError(
                "Error with output processor: field=%r value=%r error='%s: %s'"
                % (field_name, value, type(e).__name__, str(e))
            ) from e


    def get_collected_values(self, field_name):
        """Return the collected values for the given field."""
        return self._values.get(field_name, [])


    def get_input_processor(self, field_name):
        proc = getattr(self, "%s_in" % field_name, None)
        if not proc:
            proc = self._get_item_field_attr(
                field_name, "input_processor", self.default_input_processor
            )
        return unbound_method(proc)

    def get_output_processor(self, field_name):
        proc = getattr(self, "%s_out" % field_name, None)
        if not proc:
            proc = self._get_item_field_attr(
                field_name, "output_processor", self.default_output_processor
            )
        return unbound_method(proc)

    def _get_item_field_attr(self, field_name, key, default=None):
        field_meta = ItemAdapter(self.item).get_field_meta(field_name)
        return field_meta.get(key, default)

    def _process_input_value(self, field_name, value):
        proc = self.get_input_processor(field_name)
        _proc = proc
        proc = wrap_loader_context(proc, self.context)
        try:
            return proc(value)
        except Exception as e:
            raise ValueError(
                "Error with input processor %s: field=%r value=%r "
                "error='%s: %s'"
                % (
                    _proc.__class__.__name__,
                    field_name,
                    value,
                    type(e).__name__,
                    str(e),
                )
            ) from e

    def _check_selector_method(self):
        if self.selector is None:
            raise RuntimeError(
                "To use XPath or CSS selectors, %s "
                "must be instantiated with a selector" % self.__class__.__name__
            )

    def add_xpath(self, field_name, xpath, *processors, re=None, **kw):
        """
        Similar to :meth:`ItemLoader.add_value` but receives an XPath instead of a
        value, which is used to extract a list of strings from the
        selector associated with this :class:`ItemLoader`.

        See :meth:`get_xpath` for ``kwargs``.

        :param xpath: the XPath to extract data from
        :type xpath: str

        Examples::

            # HTML snippet: <p class="product-name">Color TV</p>
            loader.add_xpath('name', '//p[@class="product-name"]')
            # HTML snippet: <p id="price">the price is $1200</p>
            loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')

        """
        values = self._get_xpathvalues(xpath, **kw)
        self.add_value(field_name, values, *processors, re=re, **kw)


    def replace_xpath(self, field_name, xpath, *processors, re=None, **kw):
        """
        Similar to :meth:`add_xpath` but replaces collected data instead of adding it.
        """
        values = self._get_xpathvalues(xpath, **kw)
        self.replace_value(field_name, values, *processors, re=re, **kw)


    def get_xpath(self, xpath, *processors, re=None, **kw):
        """
        Similar to :meth:`ItemLoader.get_value` but receives an XPath instead of a
        value, which is used to extract a list of unicode strings from the
        selector associated with this :class:`ItemLoader`.

        :param xpath: the XPath to extract data from
        :type xpath: str

        :param re: a regular expression to use for extracting data from the
            selected XPath region
        :type re: str or typing.Pattern

        Examples::

            # HTML snippet: <p class="product-name">Color TV</p>
            loader.get_xpath('//p[@class="product-name"]')
            # HTML snippet: <p id="price">the price is $1200</p>
            loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')

        """
        values = self._get_xpathvalues(xpath, **kw)
        return self.get_value(values, *processors, re=re, **kw)


    def _get_xpathvalues(self, xpaths, **kw):
        self._check_selector_method()
        xpaths = arg_to_iter(xpaths)
        return flatten(self.selector.xpath(xpath, **kw).getall() for xpath in xpaths)

    def add_css(self, field_name, css, *processors, re=None, **kw):
        """
        Similar to :meth:`ItemLoader.add_value` but receives a CSS selector
        instead of a value, which is used to extract a list of unicode strings
        from the selector associated with this :class:`ItemLoader`.

        See :meth:`get_css` for ``kwargs``.

        :param css: the CSS selector to extract data from
        :type css: str

        Examples::

            # HTML snippet: <p class="product-name">Color TV</p>
            loader.add_css('name', 'p.product-name')
            # HTML snippet: <p id="price">the price is $1200</p>
            loader.add_css('price', 'p#price', re='the price is (.*)')
        """
        values = self._get_cssvalues(css)
        self.add_value(field_name, values, *processors, re=re, **kw)


    def replace_css(self, field_name, css, *processors, re=None, **kw):
        """
        Similar to :meth:`add_css` but replaces collected data instead of adding it.
        """
        values = self._get_cssvalues(css)
        self.replace_value(field_name, values, *processors, re=re, **kw)


    def get_css(self, css, *processors, re=None, **kw):
        """
        Similar to :meth:`ItemLoader.get_value` but receives a CSS selector
        instead of a value, which is used to extract a list of unicode strings
        from the selector associated with this :class:`ItemLoader`.

        :param css: the CSS selector to extract data from
        :type css: str

        :param re: a regular expression to use for extracting data from the
            selected CSS region
        :type re: str or typing.Pattern

        Examples::

            # HTML snippet: <p class="product-name">Color TV</p>
            loader.get_css('p.product-name')
            # HTML snippet: <p id="price">the price is $1200</p>
            loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
        """
        values = self._get_cssvalues(css)
        return self.get_value(values, *processors, re=re, **kw)


    def _get_cssvalues(self, csss):
        self._check_selector_method()
        csss = arg_to_iter(csss)
        return flatten(self.selector.css(css).getall() for css in csss)

    def add_jmes(self, field_name, jmes, *processors, re=None, **kw):
        """
        Similar to :meth:`ItemLoader.add_value` but receives a JMESPath selector
        instead of a value, which is used to extract a list of unicode strings
        from the selector associated with this :class:`ItemLoader`.

        See :meth:`get_jmes` for ``kwargs``.

        :param jmes: the JMESPath selector to extract data from
        :type jmes: str

        Examples::

            # HTML snippet: {"name": "Color TV"}
            loader.add_jmes('name')
            # HTML snippet: {"price": the price is $1200"}
            loader.add_jmes('price', TakeFirst(), re='the price is (.*)')
        """
        values = self._get_jmesvalues(jmes)
        self.add_value(field_name, values, *processors, re=re, **kw)


    def replace_jmes(self, field_name, jmes, *processors, re=None, **kw):
        """
        Similar to :meth:`add_jmes` but replaces collected data instead of adding it.
        """
        values = self._get_jmesvalues(jmes)
        self.replace_value(field_name, values, *processors, re=re, **kw)


    def get_jmes(self, jmes, *processors, re=None, **kw):
        """
        Similar to :meth:`ItemLoader.get_value` but receives a JMESPath selector
        instead of a value, which is used to extract a list of unicode strings
        from the selector associated with this :class:`ItemLoader`.

        :param jmes: the JMESPath selector to extract data from
        :type jmes: str

        :param re: a regular expression to use for extracting data from the
            selected JMESPath
        :type re: str or typing.Pattern

        Examples::

            # HTML snippet: {"name": "Color TV"}
            loader.get_jmes('name')
            # HTML snippet: {"price": the price is $1200"}
            loader.get_jmes('price', TakeFirst(), re='the price is (.*)')
        """
        values = self._get_jmesvalues(jmes)
        return self.get_value(values, *processors, re=re, **kw)


    def _get_jmesvalues(self, jmess):
        self._check_selector_method()
        jmess = arg_to_iter(jmess)
        if not hasattr(self.selector, "jmespath"):
            raise AttributeError(
                "Please install parsel >= 1.8.1 to get jmespath support"
            )
        return flatten(self.selector.jmespath(jmes).getall() for jmes in jmess)
```

## `itemloaders.processors`

```python
"""
This module provides some commonly used processors for Item Loaders.

See documentation in docs/topics/loaders.rst
"""
from collections import ChainMap

from itemloaders.common import wrap_loader_context
from itemloaders.utils import arg_to_iter


class MapCompose:
    """
    A processor which is constructed from the composition of the given
    functions, similar to the :class:`Compose` processor. The difference with
    this processor is the way internal results are passed among functions,
    which is as follows:

    The input value of this processor is *iterated* and the first function is
    applied to each element. The results of these function calls (one for each element)
    are concatenated to construct a new iterable, which is then used to apply the
    second function, and so on, until the last function is applied to each
    value of the list of values collected so far. The output values of the last
    function are concatenated together to produce the output of this processor.

    Each particular function can return a value or a list of values, which is
    flattened with the list of values returned by the same function applied to
    the other input values. The functions can also return ``None`` in which
    case the output of that function is ignored for further processing over the
    chain.

    This processor provides a convenient way to compose functions that only
    work with single values (instead of iterables). For this reason the
    :class:`MapCompose` processor is typically used as input processor, since
    data is often extracted using the
    :meth:`~parsel.selector.Selector.extract` method of `parsel selectors`_,
    which returns a list of unicode strings.

    The example below should clarify how it works:

    >>> def filter_world(x):
    ...     return None if x == 'world' else x
    ...
    >>> from itemloaders.processors import MapCompose
    >>> proc = MapCompose(filter_world, str.upper)
    >>> proc(['hello', 'world', 'this', 'is', 'something'])
    ['HELLO', 'THIS', 'IS', 'SOMETHING']

    As with the Compose processor, functions can receive Loader contexts, and
    ``__init__`` method keyword arguments are used as default context values.
    See :class:`Compose` processor for more info.

    .. _`parsel selectors`: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.selector.Selector.extract
    """  # noqa

    def __init__(self, *functions, **default_loader_context):
        self.functions = functions
        self.default_loader_context = default_loader_context

    def __call__(self, value, loader_context=None):
        values = arg_to_iter(value)
        if loader_context:
            context = ChainMap(loader_context, self.default_loader_context)
        else:
            context = self.default_loader_context
        wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
        for func in wrapped_funcs:
            next_values = []
            for v in values:
                try:
                    next_values += arg_to_iter(func(v))
                except Exception as e:
                    raise ValueError(
                        "Error in MapCompose with "
                        "%s value=%r error='%s: %s'"
                        % (str(func), value, type(e).__name__, str(e))
                    ) from e
            values = next_values
        return values



class Compose:
    """
    A processor which is constructed from the composition of the given
    functions. This means that each input value of this processor is passed to
    the first function, and the result of that function is passed to the second
    function, and so on, until the last function returns the output value of
    this processor.

    By default, stop process on ``None`` value. This behaviour can be changed by
    passing keyword argument ``stop_on_none=False``.

    Example:

    >>> from itemloaders.processors import Compose
    >>> proc = Compose(lambda v: v[0], str.upper)
    >>> proc(['hello', 'world'])
    'HELLO'

    Each function can optionally receive a ``loader_context`` parameter. For
    those which do, this processor will pass the currently active :ref:`Loader
    context <loaders-context>` through that parameter.

    The keyword arguments passed in the ``__init__`` method are used as the default
    Loader context values passed to each function call. However, the final
    Loader context values passed to functions are overridden with the currently
    active Loader context accessible through the :attr:`ItemLoader.context
    <itemloaders.ItemLoader.context>` attribute.
    """

    def __init__(self, *functions, **default_loader_context):
        self.functions = functions
        self.stop_on_none = default_loader_context.get("stop_on_none", True)
        self.default_loader_context = default_loader_context

    def __call__(self, value, loader_context=None):
        if loader_context:
            context = ChainMap(loader_context, self.default_loader_context)
        else:
            context = self.default_loader_context
        wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
        for func in wrapped_funcs:
            if value is None and self.stop_on_none:
                break
            try:
                value = func(value)
            except Exception as e:
                raise ValueError(
                    "Error in Compose with "
                    "%s value=%r error='%s: %s'"
                    % (str(func), value, type(e).__name__, str(e))
                ) from e
        return value



class TakeFirst:
    """
    Returns the first non-null/non-empty value from the values received,
    so it's typically used as an output processor to single-valued fields.
    It doesn't receive any ``__init__`` method arguments, nor does it accept Loader contexts.

    Example:

    >>> from itemloaders.processors import TakeFirst
    >>> proc = TakeFirst()
    >>> proc(['', 'one', 'two', 'three'])
    'one'
    """

    def __call__(self, values):
        for value in values:
            if value is not None and value != "":
                return value



class Identity:
    """
    The simplest processor, which doesn't do anything. It returns the original
    values unchanged. It doesn't receive any ``__init__`` method arguments, nor does it
    accept Loader contexts.

    Example:

    >>> from itemloaders.processors import Identity
    >>> proc = Identity()
    >>> proc(['one', 'two', 'three'])
    ['one', 'two', 'three']
    """

    def __call__(self, values):
        return values



class SelectJmes:
    """
    Query the input string for the jmespath (given at instantiation), and return the answer
    Requires : jmespath(https://github.com/jmespath/jmespath)
    Note: SelectJmes accepts only one input element at a time.

    Example:

    >>> from itemloaders.processors import SelectJmes, Compose, MapCompose
    >>> proc = SelectJmes("foo") #for direct use on lists and dictionaries
    >>> proc({'foo': 'bar'})
    'bar'
    >>> proc({'foo': {'bar': 'baz'}})
    {'bar': 'baz'}

    Working with Json:

    >>> import json
    >>> proc_single_json_str = Compose(json.loads, SelectJmes("foo"))
    >>> proc_single_json_str('{"foo": "bar"}')
    'bar'
    >>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('foo')))
    >>> proc_json_list('[{"foo":"bar"}, {"baz":"tar"}]')
    ['bar']
    """

    def __init__(self, json_path):
        self.json_path = json_path
        import jmespath

        self.compiled_path = jmespath.compile(self.json_path)

    def __call__(self, value):
        """Query value for the jmespath query and return answer
        :param value: a data structure (dict, list) to extract from
        :return: Element extracted according to jmespath query
        """
        return self.compiled_path.search(value)



class Join:
    """
    Returns the values joined with the separator given in the ``__init__`` method, which
    defaults to ``' '``. It doesn't accept Loader contexts.

    When using the default separator, this processor is equivalent to the
    function: ``' '.join``

    Examples:

    >>> from itemloaders.processors import Join
    >>> proc = Join()
    >>> proc(['one', 'two', 'three'])
    'one two three'
    >>> proc = Join('<br>')
    >>> proc(['one', 'two', 'three'])
    'one<br>two<br>three'
    """

    def __init__(self, separator=" "):
        self.separator = separator

    def __call__(self, values):
        return self.separator.join(values)
```

## `scrapy.http.request`

```python
"""
This module implements the Request class which is used to represent HTTP
requests in Scrapy.

See documentation in docs/topics/request-response.rst
"""
import inspect
from typing import Callable, List, Optional, Tuple, Type, TypeVar, Union

from w3lib.url import safe_url_string

import scrapy
from scrapy.http.common import obsolete_setter
from scrapy.http.headers import Headers
from scrapy.utils.curl import curl_to_request_kwargs
from scrapy.utils.python import to_bytes
from scrapy.utils.trackref import object_ref
from scrapy.utils.url import escape_ajax

RequestTypeVar = TypeVar("RequestTypeVar", bound="Request")


def NO_CALLBACK(*args, **kwargs):
    """When assigned to the ``callback`` parameter of
    :class:`~scrapy.http.Request`, it indicates that the request is not meant
    to have a spider callback at all.

    For example:

    .. code-block:: python

       Request("https://example.com", callback=NO_CALLBACK)

    This value should be used by :ref:`components <topics-components>` that
    create and handle their own requests, e.g. through
    :meth:`scrapy.core.engine.ExecutionEngine.download`, so that downloader
    middlewares handling such requests can treat them differently from requests
    intended for the :meth:`~scrapy.Spider.parse` callback.
    """
    raise RuntimeError(
        "The NO_CALLBACK callback has been called. This is a special callback "
        "value intended for requests whose callback is never meant to be "
        "called."
    )



class Request(object_ref):
    """Represents an HTTP request, which is usually generated in a Spider and
    executed by the Downloader, thus generating a :class:`Response`.
    """

    attributes: Tuple[str, ...] = (
        "url",
        "callback",
        "method",
        "headers",
        "body",
        "cookies",
        "meta",
        "encoding",
        "priority",
        "dont_filter",
        "errback",
        "flags",
        "cb_kwargs",
    )
    """A tuple of :class:`str` objects containing the name of all public
    attributes of the class that are also keyword parameters of the
    ``__init__`` method.

    Currently used by :meth:`Request.replace`, :meth:`Request.to_dict` and
    :func:`~scrapy.utils.request.request_from_dict`.
    """

    def __init__(
        self,
        url: str,
        callback: Optional[Callable] = None,
        method: str = "GET",
        headers: Optional[dict] = None,
        body: Optional[Union[bytes, str]] = None,
        cookies: Optional[Union[dict, List[dict]]] = None,
        meta: Optional[dict] = None,
        encoding: str = "utf-8",
        priority: int = 0,
        dont_filter: bool = False,
        errback: Optional[Callable] = None,
        flags: Optional[List[str]] = None,
        cb_kwargs: Optional[dict] = None,
    ) -> None:
        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        if not isinstance(priority, int):
            raise TypeError(f"Request priority not an integer: {priority!r}")
        self.priority = priority

        if not (callable(callback) or callback is None):
            raise TypeError(
                f"callback must be a callable, got {type(callback).__name__}"
            )
        if not (callable(errback) or errback is None):
            raise TypeError(f"errback must be a callable, got {type(errback).__name__}")
        self.callback = callback
        self.errback = errback

        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter

        self._meta = dict(meta) if meta else None
        self._cb_kwargs = dict(cb_kwargs) if cb_kwargs else None
        self.flags = [] if flags is None else list(flags)

    @property
    def cb_kwargs(self) -> dict:
        if self._cb_kwargs is None:
            self._cb_kwargs = {}
        return self._cb_kwargs

    @property
    def meta(self) -> dict:
        if self._meta is None:
            self._meta = {}
        return self._meta

    def _get_url(self) -> str:
        return self._url

    def _set_url(self, url: str) -> None:
        if not isinstance(url, str):
            raise TypeError(f"Request url must be str, got {type(url).__name__}")

        s = safe_url_string(url, self.encoding)
        self._url = escape_ajax(s)

        if (
            "://" not in self._url
            and not self._url.startswith("about:")
            and not self._url.startswith("data:")
        ):
            raise ValueError(f"Missing scheme in request url: {self._url}")

    url = property(_get_url, obsolete_setter(_set_url, "url"))

    def _get_body(self) -> bytes:
        return self._body

    def _set_body(self, body: Optional[Union[str, bytes]]) -> None:
        self._body = b"" if body is None else to_bytes(body, self.encoding)

    body = property(_get_body, obsolete_setter(_set_body, "body"))

    @property
    def encoding(self) -> str:
        return self._encoding

    def __repr__(self) -> str:
        return f"<{self.method} {self.url}>"

    def copy(self) -> "Request":
        return self.replace()


    def replace(self, *args, **kwargs) -> "Request":
        """Create a new Request with the same attributes except for those given new values"""
        for x in self.attributes:
            kwargs.setdefault(x, getattr(self, x))
        cls = kwargs.pop("cls", self.__class__)
        return cls(*args, **kwargs)


    @classmethod
    def from_curl(
        cls: Type[RequestTypeVar],
        curl_command: str,
        ignore_unknown_options: bool = True,
        **kwargs,
    ) -> RequestTypeVar:
        """Create a Request object from a string containing a `cURL
        <https://curl.haxx.se/>`_ command. It populates the HTTP method, the
        URL, the headers, the cookies and the body. It accepts the same
        arguments as the :class:`Request` class, taking preference and
        overriding the values of the same arguments contained in the cURL
        command.

        Unrecognized options are ignored by default. To raise an error when
        finding unknown options call this method by passing
        ``ignore_unknown_options=False``.

        .. caution:: Using :meth:`from_curl` from :class:`~scrapy.http.Request`
                     subclasses, such as :class:`~scrapy.http.JSONRequest`, or
                     :class:`~scrapy.http.XmlRpcRequest`, as well as having
                     :ref:`downloader middlewares <topics-downloader-middleware>`
                     and
                     :ref:`spider middlewares <topics-spider-middleware>`
                     enabled, such as
                     :class:`~scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware`,
                     :class:`~scrapy.downloadermiddlewares.useragent.UserAgentMiddleware`,
                     or
                     :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware`,
                     may modify the :class:`~scrapy.http.Request` object.

        To translate a cURL command into a Scrapy request,
        you may use `curl2scrapy <https://michael-shub.github.io/curl2scrapy/>`_.
        """
        request_kwargs = curl_to_request_kwargs(curl_command, ignore_unknown_options)
        request_kwargs.update(kwargs)
        return cls(**request_kwargs)


    def to_dict(self, *, spider: Optional["scrapy.Spider"] = None) -> dict:
        """Return a dictionary containing the Request's data.

        Use :func:`~scrapy.utils.request.request_from_dict` to convert back into a :class:`~scrapy.Request` object.

        If a spider is given, this method will try to find out the name of the spider methods used as callback
        and errback and include them in the output dict, raising an exception if they cannot be found.
        """
        d = {
            "url": self.url,  # urls are safe (safe_string_url)
            "callback": _find_method(spider, self.callback)
            if callable(self.callback)
            else self.callback,
            "errback": _find_method(spider, self.errback)
            if callable(self.errback)
            else self.errback,
            "headers": dict(self.headers),
        }
        for attr in self.attributes:
            d.setdefault(attr, getattr(self, attr))
        if type(self) is not Request:  # pylint: disable=unidiomatic-typecheck
            d["_class"] = self.__module__ + "." + self.__class__.__name__
        return d



def _find_method(obj, func):
    """Helper function for Request.to_dict"""
    # Only instance methods contain ``__func__``
    if obj and hasattr(func, "__func__"):
        members = inspect.getmembers(obj, predicate=inspect.ismethod)
        for name, obj_func in members:
            # We need to use __func__ to access the original function object because instance
            # method objects are generated each time attribute is retrieved from instance.
            #
            # Reference: The standard type hierarchy
            # https://docs.python.org/3/reference/datamodel.html
            if obj_func.__func__ is func.__func__:
                return name
    raise ValueError(f"Function {func} is not an instance method in: {obj}"
```

## `scrapy.linkextractors.lxmlhtml`

```python
"""
Link extractor based on lxml.html
"""
import logging
import operator
from functools import partial
from urllib.parse import urljoin, urlparse

from lxml import etree
from parsel.csstranslator import HTMLTranslator
from w3lib.html import strip_html5_whitespace
from w3lib.url import canonicalize_url, safe_url_string

from scrapy.link import Link
from scrapy.linkextractors import (
    IGNORED_EXTENSIONS,
    _is_valid_url,
    _matches,
    _re_type,
    re,
)
from scrapy.utils.misc import arg_to_iter, rel_has_nofollow
from scrapy.utils.python import unique as unique_list
from scrapy.utils.response import get_base_url
from scrapy.utils.url import url_has_any_extension, url_is_from_any_domain

logger = logging.getLogger(__name__)

# from lxml/src/lxml/html/__init__.py
XHTML_NAMESPACE = "http://www.w3.org/1999/xhtml"

_collect_string_content = etree.XPath("string()")


def _nons(tag):
    if isinstance(tag, str):
        if tag[0] == "{" and tag[1 : len(XHTML_NAMESPACE) + 1] == XHTML_NAMESPACE:
            return tag.split("}")[-1]
    return tag


def _identity(x):
    return x


def _canonicalize_link_url(link):
    return canonicalize_url(link.url, keep_fragments=True)


class LxmlParserLinkExtractor:
    def __init__(
        self,
        tag="a",
        attr="href",
        process=None,
        unique=False,
        strip=True,
        canonicalized=False,
    ):
        self.scan_tag = tag if callable(tag) else partial(operator.eq, tag)
        self.scan_attr = attr if callable(attr) else partial(operator.eq, attr)
        self.process_attr = process if callable(process) else _identity
        self.unique = unique
        self.strip = strip
        self.link_key = (
            operator.attrgetter("url") if canonicalized else _canonicalize_link_url
        )

    def _iter_links(self, document):
        for el in document.iter(etree.Element):
            if not self.scan_tag(_nons(el.tag)):
                continue
            attribs = el.attrib
            for attrib in attribs:
                if not self.scan_attr(attrib):
                    continue
                yield (el, attrib, attribs[attrib])

    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        # hacky way to get the underlying lxml parsed document
        for el, attr, attr_val in self._iter_links(selector.root):
            # pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
            try:
                if self.strip:
                    attr_val = strip_html5_whitespace(attr_val)
                attr_val = urljoin(base_url, attr_val)
            except ValueError:
                continue  # skipping bogus links
            else:
                url = self.process_attr(attr_val)
                if url is None:
                    continue
            try:
                url = safe_url_string(url, encoding=response_encoding)
            except ValueError:
                logger.debug(f"Skipping extraction of link with bad URL {url!r}")
                continue

            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(
                url,
                _collect_string_content(el) or "",
                nofollow=rel_has_nofollow(el.get("rel")),
            )
            links.append(link)
        return self._deduplicate_if_needed(links)

    def extract_links(self, response):
        base_url = get_base_url(response)
        return self._extract_links(
            response.selector, response.url, response.encoding, base_url
        )

    def _process_links(self, links):
        """Normalize and filter extracted links

        The subclass should override it if necessary
        """
        return self._deduplicate_if_needed(links)

    def _deduplicate_if_needed(self, links):
        if self.unique:
            return unique_list(links, key=self.link_key)
        return links


[docs]class LxmlLinkExtractor:
    _csstranslator = HTMLTranslator()

    def __init__(
        self,
        allow=(),
        deny=(),
        allow_domains=(),
        deny_domains=(),
        restrict_xpaths=(),
        tags=("a", "area"),
        attrs=("href",),
        canonicalize=False,
        unique=True,
        process_value=None,
        deny_extensions=None,
        restrict_css=(),
        strip=True,
        restrict_text=None,
    ):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        self.link_extractor = LxmlParserLinkExtractor(
            tag=partial(operator.contains, tags),
            attr=partial(operator.contains, attrs),
            unique=unique,
            process=process_value,
            strip=strip,
            canonicalized=canonicalize,
        )
        self.allow_res = [
            x if isinstance(x, _re_type) else re.compile(x) for x in arg_to_iter(allow)
        ]
        self.deny_res = [
            x if isinstance(x, _re_type) else re.compile(x) for x in arg_to_iter(deny)
        ]

        self.allow_domains = set(arg_to_iter(allow_domains))
        self.deny_domains = set(arg_to_iter(deny_domains))

        self.restrict_xpaths = tuple(arg_to_iter(restrict_xpaths))
        self.restrict_xpaths += tuple(
            map(self._csstranslator.css_to_xpath, arg_to_iter(restrict_css))
        )

        if deny_extensions is None:
            deny_extensions = IGNORED_EXTENSIONS
        self.canonicalize = canonicalize
        self.deny_extensions = {"." + e for e in arg_to_iter(deny_extensions)}
        self.restrict_text = [
            x if isinstance(x, _re_type) else re.compile(x)
            for x in arg_to_iter(restrict_text)
        ]

    def _link_allowed(self, link):
        if not _is_valid_url(link.url):
            return False
        if self.allow_res and not _matches(link.url, self.allow_res):
            return False
        if self.deny_res and _matches(link.url, self.deny_res):
            return False
        parsed_url = urlparse(link.url)
        if self.allow_domains and not url_is_from_any_domain(
            parsed_url, self.allow_domains
        ):
            return False
        if self.deny_domains and url_is_from_any_domain(parsed_url, self.deny_domains):
            return False
        if self.deny_extensions and url_has_any_extension(
            parsed_url, self.deny_extensions
        ):
            return False
        if self.restrict_text and not _matches(link.text, self.restrict_text):
            return False
        return True

    def matches(self, url):
        if self.allow_domains and not url_is_from_any_domain(url, self.allow_domains):
            return False
        if self.deny_domains and url_is_from_any_domain(url, self.deny_domains):
            return False

        allowed = (
            (regex.search(url) for regex in self.allow_res)
            if self.allow_res
            else [True]
        )
        denied = (regex.search(url) for regex in self.deny_res) if self.deny_res else []
        return any(allowed) and not any(denied)

    def _process_links(self, links):
        links = [x for x in links if self._link_allowed(x)]
        if self.canonicalize:
            for link in links:
                link.url = canonicalize_url(link.url)
        links = self.link_extractor._process_links(links)
        return links

    def _extract_links(self, *args, **kwargs):
        return self.link_extractor._extract_links(*args, **kwargs)

[docs]    def extract_links(self, response):
        """Returns a list of :class:`~scrapy.link.Link` objects from the
        specified :class:`response <scrapy.http.Response>`.

        Only links that match the settings passed to the ``__init__`` method of
        the link extractor are returned.

        Duplicate links are omitted if the ``unique`` attribute is set to ``True``,
        otherwise they are returned.
        """
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [
                subdoc for x in self.restrict_xpaths for subdoc in response.xpath(x)
            ]
        else:
            docs = [response.selector]
        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            all_links.extend(self._process_links(links))
        if self.link_extractor.unique:
            return unique_list(all_links)
        return all_links
```

## `scrapy.link`

```python
"""
This module defines the Link object used in Link extractors.

For actual link extractors implementation see scrapy.linkextractors, or
its documentation in: docs/topics/link-extractors.rst
"""
from typing import Any


[docs]class Link:
    """Link objects represent an extracted link by the LinkExtractor.

    Using the anchor tag sample below to illustrate the parameters::

            <a href="https://example.com/nofollow.html#foo" rel="nofollow">Dont follow this one</a>

    :param url: the absolute url being linked to in the anchor tag.
                From the sample, this is ``https://example.com/nofollow.html``.

    :param text: the text in the anchor tag. From the sample, this is ``Dont follow this one``.

    :param fragment: the part of the url after the hash symbol. From the sample, this is ``foo``.

    :param nofollow: an indication of the presence or absence of a nofollow value in the ``rel`` attribute
                    of the anchor tag.
    """

    __slots__ = ["url", "text", "fragment", "nofollow"]

    def __init__(
        self, url: str, text: str = "", fragment: str = "", nofollow: bool = False
    ):
        if not isinstance(url, str):
            got = url.__class__.__name__
            raise TypeError(f"Link urls must be str objects, got {got}")
        self.url: str = url
        self.text: str = text
        self.fragment: str = fragment
        self.nofollow: bool = nofollow

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, Link):
            raise NotImplementedError
        return (
            self.url == other.url
            and self.text == other.text
            and self.fragment == other.fragment
            and self.nofollow == other.nofollow
        )

    def __hash__(self) -> int:
        return (
            hash(self.url) ^ hash(self.text) ^ hash(self.fragment) ^ hash(self.nofollow)
        )

    def __repr__(self) -> str:
        return (
            f"Link(url={self.url!r}, text={self.text!r}, "
            f"fragment={self.fragment!r}, nofollow={self.nofollow!r})"
        )
```

# <b>(FORGET) Additional</b>

|bash|description|
|-|-|
|`scrapy startproject <name>`|start a new scrapy project|
|`scrapy genspider <spider_name> <domain>`|generate a spider in the `spider` dir|
|`scrapy runspider <spider_file>.py`|start the crawler|
|||
|||
|||

# 3. Creating a Scrapy project

You should work in the virtual environment.

```sh
pip install --upgrade pip
pip install scrapy
```

A **spider** is a Scrapy project that, like its arachnid namesake, is designed to crawl webs.

```sh
$ scrapy startproject test1
```
```
New Scrapy project 'test1', using template directory '/home/commi/venv/venv3.12/lib/python3.12/site-packages/scrapy/templates/project', created in:
    /home/commi/Yandex.Disk/it_learning/08_parsing_data/data/test1

You can start your first spider with:
    cd test1
    scrapy genspider example example.com
```

## Project dir

In [35]:
cd /home/commi/Yandex.Disk/it_learning/08_parsing_data/data/

In [36]:
tree test1

[01;34mtest1[0m
├── scrapy.cfg
└── [01;34mtest1[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


In [37]:
cat test1/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = test1.settings

[deploy]
#url = http://localhost:6800/
project = test1


### Deeper

In [38]:
tree test1/test1

[01;34mtest1/test1[0m
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── [01;34mspiders[0m
    └── __init__.py

2 directories, 6 files


In [39]:
cat test1/test1/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Test1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [40]:
cat test1/test1/middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class Test1SpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spide

In [41]:
cat test1/test1/pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class Test1Pipeline:
    def process_item(self, item, spider):
        return item


In [42]:
cat test1/test1/settings.py

# Scrapy settings for test1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "test1"

SPIDER_MODULES = ["test1.spiders"]
NEWSPIDER_MODULE = "test1.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "test1 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay settin

### Even deeper

In [43]:
tree test1/test1/spiders

[01;34mtest1/test1/spiders[0m
└── __init__.py

1 directory, 1 file


In [44]:
cat test1/test1/spiders/__init__.py

# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


# 4. Write a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at test1/test1/spiders/bookspider.py.

```sh
$ cd test1/test1/spiders/
$ scrapy genspider bookspider books.toscrape.com
```
```
Created spider 'bookspider' using template 'basic' in module:
  test1.spiders.bookspider
```

In [45]:
tree test1/test1/spiders/

[01;34mtest1/test1/spiders/[0m
├── bookspider.py
├── __init__.py
└── [01;34m__pycache__[0m
    └── __init__.cpython-312.pyc

2 directories, 3 files


In [46]:
cat test1/test1/spiders/bookspider.py

import scrapy


class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        pass
