In [15]:
%%HTML
<!-- execute this cell before continue -->
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato">
<style>.reveal * { font-family: "Lato" !important; } .reveal .code_cell * { font-family: monospace !important; }</style>

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    Web Scraping in Python
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:43%; left:10%;">
    David Mertz, Ph.D.
</h3>
</div>

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    mertz@kdm.training
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    @mertz_david
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz
</p>

</div>

<br><br><br>

<h2 style="font-weight: bold;">
    Scrapy
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Unlike Beautiful Soup which is a library for parsing and manipulating HTML files, Scrapy is a full-fledged *spider* (a.k.a. *web crawler*, or *robot*) that focuses on retrieving and processing many web pages during a run.  Generally, `scrapy` is used as a command line tool that will utilize scripts, and provides various command-line switches for that use. This can also easily be put into a cronjob or other automated scheduling.

Scrapy utilizes XPATH and CSS selectors to identify portions of a web page scraped.  Optionally, you are perfectly free to perform the actual parsing portion using the now-familiar Beautiful Soup library.  Moreover, Scrapy includes asynchronous scheduling of requests for many web pages.  Unlike the sequential scraper we wrote in the prior lesson, scripts for Scrapy will concurrently download the resources identified (and perhaps queued as a result of finding links within scraped pages).

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   

At the start, let us import a few capabilities, as commonly in these courses.

</div>

In [2]:
import re
import json
from bs4 import BeautifulSoup
import scrapy

<h2 style="font-weight: bold;">
    A basic preview
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The first example shown here is based on an early one from Scrapy's documentation.  In particular, it scrapes the test domain `toscrape.com` that is provided for just this kind of practice.


First we can look at the source code of the file defining the scraper, then we can run it from the command line.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```python
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'Quotes'
    start_urls = ['http://quotes.toscrape.com/']
```
```python
    custom_settings = {
        "DEPTH_LIMIT": 1,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
        'ROBOTSTXT_OBEY': True,
        'DOWNLOAD_DELAY': 0.25,
    }
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```python
    def parse(self, response):
        for quote in response.css('div.quote'):
            text = quote.css('span.text::text').get()
            yield {'author': quote.xpath('span/small/text()').get(),
                   'text': text.replace('“', '').replace('”', '')}
  
        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

The output to STDERR from the run of the spider is fairly long. Let us save it in a file to look through.  Here we run the `QuoteScrape.py` file shown and send output as JSON Lines to `quotes.jl`.

In [16]:
!rm -f quotes.jl
!scrapy runspider QuoteScrape.py -o quotes.jl -L INFO 2>quotescrape.log

In [4]:
!sed -n '1,16p' quotescrape.log

2020-11-16 15:34:50 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-11-16 15:34:50 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-5.4.0-7634-generic-x86_64-with-glibc2.10
2020-11-16 15:34:50 [scrapy.crawler] INFO: Overridden settings:
{'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
 'DEPTH_LIMIT': 1,
 'DOWNLOAD_DELAY': 0.25,
 'LOG_LEVEL': 'INFO',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True}
2020-11-16 15:34:50 [scrapy.extensions.telnet] INFO: Telnet Password: a86c7570c2ef38c3
2020-11-16 15:34:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedE

In [5]:
!sed -n '17,29p' quotescrape.log

2020-11-16 15:34:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']


In [6]:
!sed -n '30,42p' quotescrape.log

2020-11-16 15:34:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-11-16 15:34:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-11-16 15:34:50 [scrapy.core.engine] INFO: Spider opened
2020-11-16 15:34:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-16 15:34:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-11-16 15:34:51 [scrapy.core.engine] INFO: Closing spider (finished)
2020-11-16 15:34:51 [scrapy.extensions.feedexport] INFO: Stored jl feed (20 items) in: quotes.jl


In [7]:
!sed -n '43,99p' quotescrape.log

2020-11-16 15:34:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 724,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 5642,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.929831,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 11, 16, 20, 34, 51, 489919),
 'item_scraped_count': 20,
 'log_count/INFO': 11,
 'memusage/max': 56225792,
 'memusage/startup': 56225792,
 'request_depth_max': 1,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 11, 16, 20, 34, 50, 560088)}
2020-11-16 15:34:51 [scrapy.core.engine] INFO:

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Let us take a quick look to find out what we scraped.  In particular, there are 10 quotes per page at the URL, so limiting the depth of crawling to 1 should only fetch the first 20.  JSON Lines in a local file is a convenient format to use here, but we could configure it to store over FTP, AWS S3, Google Cloud Storage, and in formats like XML, Python pickle, or CSV (or even plugin custom backends like an RDBMS).

In [8]:
with open('quotes.jl') as fh:
    quotes = [json.loads(quote) for quote in fh]

print(f"{len(quotes)} quotes scraped")
print(f"Example: {quotes[6]['author']}")
print(quotes[6]['text'])

20 quotes scraped
Example: André Gide
It is better to be hated for what you are than to be loved for what you are not.


<h2 style="font-weight: bold;">
    Parsing with Beautiful Soup
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

If we wish to use Beautiful Soup to identify portions of the document we scrape, instead of the native XPATH and CSS selectors that Scrapy uses, we can easily substitute that portion.  Whether a given query is easier to express as a soup or as XPATH/CSS will vary according to task and person preference; Beautiful Soup is more robust against ill-formed HTML though.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```python
import scrapy
from bs4 import BeautifulSoup

class QuotesSpider(scrapy.Spider):
    name = 'Quotes'
    start_urls = ['http://quotes.toscrape.com/']
    
    custom_settings = {
        "DEPTH_LIMIT": 1,
    }
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```python
    def parse(self, response):
        soup = BeautifulSoup(response.text)
        for quote in soup.find_all('div', class_='quote'):
            text = quote.find('span', class_='text').string
            yield {'author': text.find_next(class_="author").text,
                   'text': text.replace('“', '').replace('”', '')}
  
        next_page = soup.find('li', class_='next').find('a')['href']
        if next_page is not None:
            yield response.follow(next_page, self.parse)
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

If we increase the logging level in the `scrapy` command line, there will normally be no output to worry about redirecting.

In [9]:
!rm -f quotes2.jl
!scrapy runspider QuoteScrapeBS.py -o quotes2.jl -L ERROR 

In [10]:
with open('quotes2.jl') as fh:
    quotes = [json.loads(quote) for quote in fh]

print(f"{len(quotes)} quotes scraped")
print(f"Example: {quotes[9]['author']}")
print(quotes[9]['text'])

20 quotes scraped
Example: Steve Martin
A day without sunshine is like, you know, night.


<h2 style="font-weight: bold;">
    Logging in and credentials
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Where Scrapy provides a strong advantage of simply using `requests` and Beautiful Soup is where a certain degree of authomation is needed.  For example, logging into a web site before performing scraping actions is commonly needed.

It is common for web sites include `<input type="hidden">` elements within login forms.  These will carry session data or authentication tokens that are important to preserve.  Using `scrapy.FormRequest.from_response)` captures all of those and can override only the form fields needed.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

As an example, we will login to the same `toscrape.com` domain in earlier examples.  It provides a "Login" link that must be followed to get to the form.  Any credentials succeed in this case, but the code simulates a failure.



```python
import scrapy
from bs4 import BeautifulSoup
from random import random

def authentication_failed(response):
    # Check the contents of the response, True if failed
    soup = BeautifulSoup(response.text, 'lxml')
    has_logout = [a for a in soup.find_all('a') if a.text == 'Logout']
    # randomly fail sometimes
    return not has_logout or random() > 0.75
```

```python
class LoginSpider(scrapy.Spider):
    name = 'Login quotes.toscrape.com'
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        login_link = [a['href'] for a in soup.find_all('a') 
                                if a.text == 'Login'][0]
        return response.follow(login_link, self.login)

    def login(self, response):
        self.user = choice(['user1', 'user2', 'user3', 'user4'])
        return scrapy.FormRequest.from_response(
            response, callback=self.after_login, 
            formdata={'username': self.user, 'password': 'pw'})

    def after_login(self, response):
        if authentication_failed(response):
            self.logger.error(f"Login failed for {self.user}")
            return 
        # Get one random quote author
        return response.follow('/random', self.author)
          
    def author(self, response):
        for quote in response.css('div.quote'):
            yield {'author': quote.xpath('span/small/text()').get()}
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>   

We can run the scraper a few times to encounter the failure and success of login on a given attempt.

In [11]:
!rm -f author.jl
!scrapy runspider -o author.jl Login.py -L WARNING
!cat author.jl

{"author": "Mark Twain"}


<h2 style="font-weight: bold;">
    Link extraction
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

It would not be so difficult to locate all the links on a page using XPATH or Beautful Soup, but it is a common enough need that Scrapy automates that in a class.  Here is a small program that will find all the new titles published by Project Gutenberg. Use this moderately, but since we do not actually crawl here, the load will not be high on the site.

In [12]:
import scrapy
from scrapy.linkextractors import LinkExtractor

class PG_NewTitles(scrapy.Spider):
    # A snapshot of the current "new titles" on Project Gutenberg
    name = 'New Titles'
    link_extractor = LinkExtractor()
    start_urls = ['https://dev.gutenberg.org/browse/recent/last1.html.utf8']
    
    def parse(self, response):
        for link in self.link_extractor.extract_links(response):
            # Many links to general navigation, a heuristic to narrow results
            if 'ebooks' in link.url:
                yield {"title": link.text, "url": link.url}
                # This would recurse into linked pages. Not permitted by PG
                # yield Request(link.url, callback=self.parse)

In [13]:
!rm -f newtitles.jl
!scrapy runspider -o newtitles.jl New_at_PG.py -L WARNING

In [14]:
i = 0
for line in open('newtitles.jl'):
    book = json.loads(line)
    if book['title'].startswith(('Search and', 'Bookshelves', 'Offline')):
        continue
    print(f"{book['title']}")
    print(f"{book['url']}")
    print()
    if (i := i+1) > 5: break

The Christian serving his own generationA Sermon occasioned by the lamented death of Joseph John Gurney, Esq.
https://dev.gutenberg.org/ebooks/63770

Christ Remembered at His Table
https://dev.gutenberg.org/ebooks/63769

Eight Dramas of Calderon
https://dev.gutenberg.org/ebooks/63776

Poetry of the Anti-jacobinComprising the Celebrated Political and Satirical Poems,of the Rt. Hons. G. Canning, John Hookham Frere, W. Pitt,the Marquis Wellesley, G. Ellis, W. Gifford, the Earl ofCarlisle, and Others.
https://dev.gutenberg.org/ebooks/63778

Legend
https://dev.gutenberg.org/ebooks/63775

Amica America
https://dev.gutenberg.org/ebooks/63777



<h2 style="font-weight: bold;">
    Other Scapy tools
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

When you start to use Scrapy as a large scale web scraper, a variety of tools are available to monitor or analyze the behavior.  For example, we already saw the verbose log files that can be produced (much more, in fact, if you use the `DEBUG` level).

A few other tools are worth mentioning, but will not be detailed here.  For our examples that limited themselves to scraping only a few pages, these are not relevant.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>   
</div>

**Telnet access**.  A long running spider creates a telnet server that can interrogate features of its run.  Log entries show details about the server, for example:

```
2020-11-15 16:44:03 [scrapy.extensions.telnet] INFO: Telnet Password: 1fe8a0b3190dad26
2020-11-15 16:44:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
```

The password will vary each time (and a non-default port will be chosen if one is occupied).  The settings  TELNETCONSOLE_USERNAME and TELNETCONSOLE_PASSWORD may be defined within the spider (default user is "scrapy"). The telenet console looks like a regular Python shell, but with some special objects available.

```python
% telnet localhost 6023
Trying localhost...
Connected to localhost.
Escape character is '^]'.
Username: scrapy
Password: <generated_password>

>>> engine.pause()    # pause operation
>>> engine.unpause()  # resume operation
>>> engine.stop()
Connection closed by foreign host.
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Of special interest is the `est()` command provided.  E.g.:

```python
>>> est()
Execution engine status

time()-engine.start_time                        : 8.62972998619
engine.has_capacity()                           : False
len(engine.downloader.active)                   : 16
engine.scraper.is_idle()                        : False
engine.spider.name                              : followall
engine.spider_is_idle(engine.spider)            : False
engine.slot.closing                             : False
len(engine.slot.inprogress)                     : 16
len(engine.slot.scheduler.dqs or [])            : 0
len(engine.slot.scheduler.mqs)                  : 92
len(engine.scraper.slot.queue)                  : 0
len(engine.scraper.slot.active)                 : 0
engine.scraper.slot.active_size                 : 0
engine.scraper.slot.itemproc_size               : 0
engine.scraper.slot.needs_backout()             : False
```



<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>  

**Scrapy shell**. Using the shell effectively launches a Scrapy engine with no requests pending.  You can interact with this using similar commands as with the telnet shell, or indeed execute arbitrary Python code to explore Scrapy and websites.

Let us take a look at such an interaction.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```python
[webscraping]% scrapy shell 'https://kdm.training' --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f2fd738b040>
[s]   item       {}
[s]   request    <GET https://kdm.training>
[s]   response   <200 https://www.gnosis.cx/kdm/>
[s]   settings   <scrapy.settings.Settings object at 0x7f2fd7388df0>
[s]   spider     <DefaultSpider 'default' at 0x7f2fd6bf9040>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```python
In [1]: crypt = response.xpath('//*[@class="__cf_email__"]/@data-cfemail').get()

In [2]: crypt
Out[2]: 'f39a9d959cb398979edd8781929a9d9a9d94'
    
In [3]: def decode(email):
   ...:     # Cloudflare obscures email addresses to make harvesting harder
   ...:     plaintext = ""
   ...:     k = int(email[:2], 16)
   ...:     for i in range(2, len(email)-1, 2):
   ...:         plaintext += chr(int(email[i:i+2], 16)^k)
   ...:     return plaintext
   ...:

In [4]: decode(crypt)
Out[4]: 'info@kdm.training'
```

<h2 style="font-weight: bold;">
    Summary
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

We have seen both the query languages (XPATH and CSS) that Scrapy provides.  More essential than that, we looked at Scrapy as an *engine* for powering high-performance web crawlers.  Scrapy is able to address form completion, password management, state and cookie handling, and a variety of other needs of robust robots.

In the next lesson, we look at the Selenium framework.  It overlaps greatly in capabilities with Scrapy, but utilizes actual web browsers to automate scraping.