# Scrapy Overview

Scrapy is an **application framework** for **crawling web sites** and **extracting structured data** which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also **be used to extract data using APIs** (such as Amazon Associates Web Services) or as a general purpose web crawler.

- [Documents](https://docs.scrapy.org/en/latest/intro/overview.html)

## Scrapy at a glance

### Walk-through of an example spider
Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

In [1]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

In [2]:
!scrapy runspider quotes_spider.py -o quotes.json

/bin/sh: 1: scrapy: not found


## Installation

### All platform
- `conda install scrapy`
- `pip install Scrapy`

### Dependency
- `lxml`, an efficient XML and HTML parser
- `parsel`, an HTML/XML data extraction library written on top of lxml,
- `w3lib`, a multi-purpose helper for dealing with URLs and web page encodings
- `twisted`, an asynchronous networking framework
- `cryptography` and `pyOpenSSL`, to deal with various network-level security needs

The minimal versions which Scrapy is tested against are:
- `Twisted` 14.0
- `lxml` 3.4
- `pyOpenSSL` 0.14

In [3]:
1 + 2

3

In [4]:
l = [1, 2, 4]

In [5]:
l

[1, 2, 4]

In [6]:
l.append(5)
l

[1, 2, 4, 5]

In [18]:
import math

In [20]:
dir(math)

['__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'acos',
 'acosh',
 'asin',
 'asinh',
 'atan',
 'atan2',
 'atanh',
 'ceil',
 'copysign',
 'cos',
 'cosh',
 'degrees',
 'e',
 'erf',
 'erfc',
 'exp',
 'expm1',
 'fabs',
 'factorial',
 'floor',
 'fmod',
 'frexp',
 'fsum',
 'gamma',
 'gcd',
 'hypot',
 'inf',
 'isclose',
 'isfinite',
 'isinf',
 'isnan',
 'ldexp',
 'lgamma',
 'log',
 'log10',
 'log1p',
 'log2',
 'modf',
 'nan',
 'pi',
 'pow',
 'radians',
 'sin',
 'sinh',
 'sqrt',
 'tan',
 'tanh',
 'tau',
 'trunc']

In [22]:
math.cos(math.pi)

-1.0

In [23]:
import requests

ModuleNotFoundError: No module named 'requests'