# SCRAPY [DOCUMENTATION](https://docs.scrapy.org/en/latest/index.html)

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

# Basic commands

## Scrapy command line

|Scrapy command line|Description|
|-|-|
|`scrapy --help`|list all the available commands (run from the project's directory)|
|**List info**||
|`scrapy list`|list all available crawlers (run from the project's deirectory)|
|**Crawl a random webpage in shell**||
|`scrapy shell <url>`||
|(inside shell) `view(response)`|open the resonse object in your browser|
|**Maintain the project**||
|`scrapy startproject project_name`|create a new project|
|`scrapy crawl spider_name`|run the spider (from the project's top level dir)|
|||
|||
|||
|||

```
scrapy --help
Scrapy 2.11.0 - active project: tutorial

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
```

## Scrapy extraction most common tools

|Scrapy extraction tools|Description|
|-|-|
|`view(response)`|open the response page from the shell in your web browser|
|**Response status codes**||
|`response.status`||
|**CSS selectors**||
|`response.css`||
|`response.css("title::text").getall()`|get only text from the SelectorList|
|||
|||
|||

# <b>1. Installation guide</b>

We strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system packages.

```sh
(venv) $ pip install Scrapy
```

Scrapy is written in pure Python and depends on a few key Python packages (among others):

- lxml, an efficient XML and HTML parser
- parsel, an HTML/XML data extraction library written on top of lxml,
- w3lib, a multi-purpose helper for dealing with URLs and web page encodings
- twisted, an asynchronous networking framework
- cryptography and pyOpenSSL, to deal with various network-level security needs

Some of these packages themselves depend on non-Python packages that might require additional installation steps depending on your platform. Please check [platform-specific guides](https://docs.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes).

In case of any trouble related to these dependencies, please refer to their respective installation instructions:

- [lxml installation](https://lxml.de/installation.html)
- [cryptography installation](https://cryptography.io/en/latest/installation/)

# <b>2. Scrapy tutorial</b>

This tutorial will walk you through these tasks:

- Creating a new Scrapy project
- Writing a spider to crawl a site and extract data
- Exporting the scraped data using the command line
- Changing spider to recursively follow links
- Using spider arguments

# 2.1 Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

```sh
scrapy startproject tutorial
```
```
New Scrapy project 'tutorial', using template directory '/home/commi/venv/venv3.11/lib/python3.11/site-packages/scrapy/templates/project', created in:
    /home/commi/Yandex.Disk/it_learning/08_web_scraping/02_scrapy/data/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
```

This will create a `tutorial` directory with the following contents:

In [2]:
cd ./data
tree tutorial

[01;34mtutorial[0m
├── scrapy.cfg
└── [01;34mtutorial[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


```
tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py
```

# 2.2 Our first Spider

**Spiders** are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `Spider` and define the initial requests to make, optionally 
- how to follow links in the pages, and 
- how to parse the downloaded page content to extract data.

This is the code for our first `Spider`. Save it in a file named `quotes_spider.py` under the `tutorial/spiders` directory in your project:

In [5]:
ls -R

.:
draft.py  [0m[01;34m__pycache__[0m  quotes.jsonl  quotes_spider.py  [01;34mtutorial[0m

./__pycache__:
draft.cpython-311.pyc

./tutorial:
scrapy.cfg  [01;34mtutorial[0m

./tutorial/tutorial:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  [01;34mspiders[0m

./tutorial/tutorial/spiders:
__init__.py


```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```

As you can see, our Spider subclasses `scrapy.Spider` and defines some attributes and methods:

- `name`: identifies the Spider. It **must be unique within a project**, that is, you can’t set the same name for different Spiders.

- `start_requests()`: must return an iterable of `Request`s (you can return a **list** of requests or write a **generator** function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

- `parse()`: a method that will be called to handle the response downloaded for each of the requests made. The `response` parameter is an instance of `TextResponse` that holds the page content and has further helpful methods to handle it.

The `parse()` method usually parses the `response`, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (`Request`) from them.

## How to run our spider

To put our spider to work, go to the project’s top level directory and run:

```sh
scrapy crawl quotes
```
This command runs the spider with name `quotes` that we’ve just added, that will send some requests for the `quotes.toscrape.com` domain. You will get an output similar to this:

```
2024-02-05 01:57:09 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)
2024-02-05 01:57:09 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
2024-02-05 01:57:09 [scrapy.addons] INFO: Enabled addons:
[]
2024-02-05 01:57:09 [asyncio] DEBUG: Using selector: EpollSelector
2024-02-05 01:57:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-05 01:57:09 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-02-05 01:57:09 [scrapy.extensions.telnet] INFO: Telnet Password: b9a7a03404bf04d7
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-02-05 01:57:09 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-05 01:57:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-05 01:57:09 [scrapy.core.engine] INFO: Spider opened
2024-02-05 01:57:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-02-05 01:57:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-05 01:57:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2024-02-05 01:57:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2024-02-05 01:57:10 [quotes] DEBUG: Saved file quotes-1.html
2024-02-05 01:57:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2024-02-05 01:57:10 [quotes] DEBUG: Saved file quotes-2.html
2024-02-05 01:57:10 [scrapy.core.engine] INFO: Closing spider (finished)
2024-02-05 01:57:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 684,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 25556,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.235283,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 2, 4, 20, 57, 10, 482312, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 8,
 'log_count/INFO': 10,
 'memusage/max': 65585152,
 'memusage/startup': 65585152,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 2, 4, 20, 57, 9, 247029, tzinfo=datetime.timezone.utc)}
2024-02-05 01:57:10 [scrapy.core.engine] INFO: Spider closed (finished)
```

Now, check the files in the current directory. You should notice that two new files have been created: 
- quotes-1.html and 
- quotes-2.html, 

with the content for the respective URLs, as our parse method instructs:

In [10]:
ls ./tutorial

quotes-1.html  quotes-2.html  scrapy.cfg  [0m[01;34mtutorial[0m


In [12]:
cat ./tutorial/quotes-1.html

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

## What just happened under the hood?

Scrapy schedules the `scrapy.Request` objects returned by the `start_requests` method of the Spider. Upon receiving a `response` for each one, it instantiates `Response` objects and calls the callback method associated with the `request` (in this case, the `parse` method) passing the response as argument.

## A shortcut to the `start_requests` method

Instead of implementing a `start_requests()` method that generates `scrapy.Request` objects from URLs, you can just define a `start_urls` class attribute with a list of URLs. This list will then be used by the default implementation of `start_requests()` to create the initial requests for your spider.

```python
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")
```

The `parse()` method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because `parse()` is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

## Extracting data

The best way to learn how to extract data with Scrapy is trying **selectors** using the Scrapy shell. Run:

> Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls containing arguments (i.e. & character) will not work.<br>
</br>
On Windows, use double quotes instead:<br>
</br>
`scrapy shell "https://quotes.toscrape.com/page/1/"`

```sh
scrapy shell 'https://quotes.toscrape.com/page/1/'
```
```
2024-02-06 14:47:14 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)
2024-02-06 14:47:14 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
2024-02-06 14:47:14 [scrapy.addons] INFO: Enabled addons:
[]
2024-02-06 14:47:14 [asyncio] DEBUG: Using selector: EpollSelector
2024-02-06 14:47:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-06 14:47:14 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-02-06 14:47:14 [scrapy.extensions.telnet] INFO: Telnet Password: 53d4e3939b5fb7e7
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2024-02-06 14:47:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-06 14:47:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-06 14:47:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-06 14:47:14 [scrapy.core.engine] INFO: Spider opened
2024-02-06 14:47:15 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2024-02-06 14:47:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f5b3ece0cd0>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7f5b3ffddc10>
[s]   spider     <DefaultSpider 'default' at 0x7f5b3e7fa950>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2024-02-06 14:47:16 [asyncio] DEBUG: Using selector: EpollSelector
```
```ipython
In [1]: 
```

Using the shell, you can try selecting elements using [CSS](https://www.w3.org/TR/selectors) with the `response` object:

```ipython
In [1]: response.css("title")
Out[1]: [<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

In [2]: response.status
Out[2]: 200
```

### `get_all()` and `get()`

The result of running `response.css('title')` is a list-like object called **SelectorList**, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:

```ipython
In [9]: response.css("title::text").getall()
Out[9]: ['Quotes to Scrape']
```

There are two things to note here: one is that we’ve added `::text` to the CSS query, to mean we want to select only the text elements directly inside `<title>` element. If we don’t specify `::text`, we’d get the full title element, including its tags:

```ipython
In [11]: response.css("title").getall()
Out[11]: ['<title>Quotes to Scrape</title>']
```

The other thing is that the result of calling `.getall()` is a _list_: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

```ipython
In [12]: response.css("title::text").get()
Out[12]: 'Quotes to Scrape'
```

As an alternative, you could’ve written:

```ipython
In [16]: response.css("title::text")[0].get()
Out[16]: 'Quotes to Scrape'
```

Accessing an index on a SelectorList instance will raise an `IndexError` exception if there are no results. You might want to use `.get()` directly on the SelectorList instance instead, which returns `None` if there are no results:

```ipython
In [17]: response.css("noelement").get()
In [18]: response.css("noelement")[0].get()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
...
IndexError: list index out of range
```

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.

### `re()`

Besides the `getall()` and `get()` methods, you can also use the `re()` method to extract using [regular expressions](https://docs.python.org/3/library/re.html):

```ipython
In [20]: response.css("title::text").re(r".*uot.*")
Out[20]: ['Quotes to Scrape']

In [21]: response.css("title::text").re(r"Q\w+")
Out[21]: ['Quotes']

In [22]: response.css("title::text").re(r"(\w+) to (\w+)")
Out[22]: ['Quotes', 'Scrape']
```

- `\w` represents any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).

### `view(response)`

In order to find the proper CSS selectors to use, you might find it useful to open the `response` page from the shell in your web browser using `view(response)`. You can use your browser’s developer tools to inspect the HTML and come up with a selector (see [Using your browser’s Developer Tools for scraping](https://docs.scrapy.org/en/latest/topics/developer-tools.html#topics-developer-tools)).

[Selector Gadget](https://selectorgadget.com/) is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.

### `XPath`: a brief intro

See [XPath](../XPath_tutorial.ipynb#XPath).

Besides CSS, Scrapy selectors also support using XPath expressions:

```ipython
In [24]: response.xpath("//title")
Out[24]: [<Selector query='//title' data='<title>Quotes to Scrape</title>'>]

In [25]: response.xpath("//title/text()").get()
Out[25]: 'Quotes to Scrape'
```

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. [You can see that](#Extracting-data) if you read closely the text representation of the selector objects in the shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: _select the link that contains the text “Next Page”_. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using [XPath with Scrapy Selectors](https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors). To learn more about XPath, we recommend this [tutorial to learn XPath through examples](http://zvon.org/comp/r/tut-XPath_1.html), and this tutorial to learn [“how to think in XPath”](http://plasmasturm.org/log/xpath101/).

### Extracting quotes and authors

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in `https://quotes.toscrape.com` is represented by HTML elements that look like this:

```html
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
```

Let’s open up `scrapy shell` and play a bit to find out how to extract the data we want. We get a list of selectors for the quote HTML elements with:

```sh
scrapy shell 'https://quotes.toscrape.com'
```
```ipython
In [1]: response.css("div.quote")
Out[1]: 
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>]
```

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:

```ipython
In [3]: quote = response.css("div.quote")[0]
```

Now, let’s extract `text`, `author` and the `tags` from that quote using the `quote` object we just created:

```ipython
In [4]: text = quote.css("span.text::text").get()
In [5]: text
Out[5]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
```

Given that the `tags` are a list of strings, we can use the `.getall()` method to get all of them:

```ipython
In [6]: tags = quote.css("div.tags a.tag::text").getall()
In [7]: tags
Out[7]: ['change', 'deep-thoughts', 'thinking', 'world']
```

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary:

```ipython
In [8]: for quote in response.css("div.quote"):
   ...:     text = quote.css("span.text::text").get()
   ...:     author = quote.css("small.author::text").get()
   ...:     tags = quote.css("div.tags a.tag::text").getall()
   ...:     print(dict(text=text, author=author, tags=tags))
   ...: 
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
```

### Extracting data in our spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the `yield` Python keyword in the `callback`, as you can see below:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
```

To run this spider, exit the `scrapy shell` and run the crawler:

```sh
quit()
scrapy crawl quotes
```
```
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2024-02-06 23:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2024-02-06 23:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
2024-02-06 23:53:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/2/>
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2024-02-06 23:53:53 [scrapy.core.engine] INFO: Closing spider (finished)
```

## Storing the scraped data

The simplest way to store the scraped data is by using [Feed exports](#2.8-Feed-exports), with the following command:

```sh
scrapy crawl quotes -O quotes.json
```

That will generate a `quotes.json` file containing all scraped items, serialized in JSON.

The `-O` command-line switch overwrites any existing file; use `-o` instead to append new content to any existing file. However, appending to a JSON file makes the file contents invalid JSON. When appending to a file, consider using a different serialization format, such as `JSON Lines`:

```sh
scrapy crawl quotes -o quotes.jsonl
```

The [JSON Lines format](http://jsonlines.org/) is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like `JQ` to help do that at the command-line.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an [Item Pipeline](#2.7-Item-Pipeline). A placeholder file for Item Pipelines has been set up for you when the project is created, in `tutorial/pipelines.py`. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

```sh
$ scrapy crawl -h
```
```
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
  -h, --help            show this help message and exit
  -a NAME=VALUE         set spider argument (may be repeated)
  -o FILE, --output FILE
                        append scraped items to the end of FILE (use - for stdout), to define format set a colon at the end of the output
                        URI (i.e. -o FILE:FORMAT)
  -O FILE, --overwrite-output FILE
                        dump scraped items into FILE, overwriting any existing file, to define format set a colon at the end of the
                        output URI (i.e. -O FILE:FORMAT)
  -t FORMAT, --output-format FORMAT
                        format to use for dumping items

Global Options
--------------
  --logfile FILE        log file. if omitted stderr will be used
  -L LEVEL, --loglevel LEVEL
                        log level (default: DEBUG)
  --nolog               disable logging completely
  --profile FILE        write python cProfile stats to FILE
  --pidfile FILE        write process ID to FILE
  -s NAME=VALUE, --set NAME=VALUE
                        set/override setting (may be repeated)
  --pdb                 enable pdb on failure
```

In [3]:
# man jq

## Following links

Let’s say, instead of just scraping the stuff from the first two pages from https://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let’s see how to follow links from them.

First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:

```html
<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>
```

We can try extracting it in the shell:

```ipython
In [1]: response.css("li.next a")
Out[1]: [<Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' next ')]/descendant-or-self::*/a" data='<a href="/page/2/">Next <span aria-hi...'>]

In [2]: response.css("li.next a").get()
Out[2]: '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
```

This gets the anchor element, but we want the attribute `href`. For that, Scrapy supports a CSS extension that lets you select the attribute contents, like this:

```ipython
In [3]: response.css("li.next a::attr(href)").get()
Out[3]: '/page/2/'
```
There is also an `attrib` property available (see [Selecting element attributes](#Selecting-element-attributes) for more):

```ipython
In [4]: response.css("li.next a").attrib["href"]
Out[4]: '/page/2/'
```

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
```

Now, after extracting the data, the `parse()` method looks for the link to the next page, builds a full absolute URL using the `urljoin()` method (since the links can be relative) and `yield`s a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you `yield` a `Request` in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with **pagination**.

### A shortcut for creating Requests

As a shortcut for creating `Request` objects you can use `response.follow`:

```python
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
```

Unlike `scrapy.Request`, `response.follow` supports relative URLs directly - no need to call `urljoin`. Note that `response.follow` just returns a `Request` instance; you still have to `yield` this `Request`.

You can also pass a selector to `response.follow` instead of a string; this selector should extract necessary attributes:

```python
for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)
```

For `<a>` elements there is a shortcut: `response.follow` uses their `href` attribute automatically. So the code can be shortened further:

```python
for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)
```

To create multiple requests from an iterable, you can use `response.follow_all` instead:

```python
anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)
```

or, shortening it further:

```python
yield from response.follow_all(css="ul.pager a", callback=self.parse)
```



```ipython

```



```ipython

```

# <b>Basic concepts</b>

# 2.3 Selectors

## Selecting element attributes

# 2.7 Item Pipeline

# 2.8 Feed exports