In [None]:
%%HTML
<!-- execute this cell before continue -->
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato">
<style>.reveal * { font-family: "Lato" !important; } .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 { font-family: "Lato" !important; } .reveal .code_cell *, .reveal code, .reveal code * { font-family: monospace !important; }</style>

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    Web Scraping in Python
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:43%; left:10%;">
    David Mertz, Ph.D.
</h3>
</div>

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    mertz@kdm.training
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    @mertz_david
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz
</p>

</div>

<br><br><br>

<h2 style="font-weight: bold;">
    Law, Ethics, and Robots
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

When scraping websites, you should carefully consider the possibility that copyright, publicity rights, or database rights might apply to the content you scraped.  The specific limitations that apply vary greatly by jurisdiction, and INE cannot provide legal advice on these matters.  Nonetheless, in a general sense, the right you have to read a web page for your own personal use may not generalize to a right to republish content on that page, nor even to publish or utilize aggregations or summarizations (such as statistics about or extractions of numeric values) of the content of sites.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Copyright rules are similar across the world, since nearly all nations are signatories of the Berne Convention.  Publicity rights are more strongly governed in the European Union than in most places.  So-called database rights are excluded in Australia, and in the United States by the 1991 Feist Publications v. Rural Telephone Service Supreme Court case.  But this area becomes increasingly complicated when web servers and scraping robots are located across varying jurisdictions.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Over and above the rules created by laws and treaties, the IETF draft standard for the Robots Exclusion Standard creates a reasonable ethical standard for uses of websites that should not be automated, or for limitations that automatic access should follow.  For example, a web server may indicate that certain URLs should not be indexed, or that web crawlers should not put an undue burden on a site by too frequent access.  The requests may apply either to specific robots or to all automated programs, in a formally specified manner.  Whether your particular web scraping program constitutes a robot or web crawler under the website's intentions is an additional judgement you need to make; it will be driven by the specifics of your program and of the site.

In this lesson we discuss specific technical means to process the robots.txt file and the other exclusion mechanisms.

<h2 style="font-weight: bold;">
    Three robot languages
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

There are actually three separate mechanism by which websites might inform users of their intention about use of the site by automated processes.  On top of those three technical mechanisms, if a particular site requires registration to interact with its content, it probably publishes terms-of-service that impose some restrictions.

In evaluating what do to with your robots, think about the relationship between your purpose and the guidance the site provides.  As a general rule, creating inadvertent *denial of service attacks* on websites is very bad manners.  It is unfortunately easy to let a particular process run without appropriate limits through small programming errors, hence possibly putting undue load on a server.  Even sites that are happy to be spidered or crawled wish to do so on a reasonable schedule (many will explicitly block you if they detect bad behavior).

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

The three places to look for site guidance in crawling or scraping a site are:

* Link nofollow directive
* META robots tag
* Robots.txt file

The last of those is the most widely used and most often honored.

<h2 style="font-weight: bold;">
    Link nofollow
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Within HTML links—i.e. `<a>` tags—as well as an `href` attribute that specifies where the link points, you commonly encounter a `rel` attribute that describes how the link is related to the current page.  The *relationship* described may have a single value, or it may contain multiple values separated by spaces or commas.

Common relationship semantics that may affect your web crawling purposes, but that do not reflect any specific request by the creator of the site, include `author`, `bookmark`, `external`, `help`, `license`, `login`, `logout`, `next`, `prev`, and `search`.  These words are not enforced, and particular websites might use entirely different words.  However, the words are generally meaningful and descriptive, and may help your robot navigate to relevant linked pages.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Two link relationship terms you might see as `noreferrer` and `noopener`.  These are unlikely to affect your web crawler, but tell a web browser not to tell a destination site where an incoming link came from.

Of relevant in designing your web crawler are `nofollow`, `ugc`, and `sponsored`.  The last two are largely Google conventions for "user-generated content" and "sponsored (advertising) link."  Google being what it is, many sites decide to add this information.  

The key one for you to pay attention to is `nofollow`, which is a *request* by the web site creator for a robot (spider/web crawler) not to follow a given link automatically.  Of course, having that tag on one specific link does not mean that the same URL might not occur elsewhere without that tagged word.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Let us read a page I created for examples.  Beautiful Soup will return the various `rel` values as a list if they are space separated, but does not do so automatically for commas. Spaces are the more common convention by site creators, but not universal.

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://kdm.training/link-relations.html"
page = requests.get(url).text
soup = BeautifulSoup(page)
for a in soup.find_all('a'):
    rel = a.get('rel')
    if rel and len(rel) == 1:  # Split if single comma-separated
        rel = rel[0].split(',')
    print(a.text, rel)

Lorem ipsum None
Advertising ['sponsored']
User-generated content ['ugc']
Nofollow links ['nofollow']
Mulitplicity in philosophy ['ugc', 'external', 'author']
Input device sharing ['external', 'nofollow', 'sponsored']


<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

It your decision how to treat these link relationships.  It is easy to forget to check for the `rel` attribute of the `a` tag at all, and most working code probably does so.  Very often that attribute does not exist, in any case. 

Assuming you are trying to use the guidance, you might write code similar to the following.  The below is simple Beautiful Soup code, but the same concept would apply with Scrapy, or Selenium, or any other library.

In [2]:
def links_to_follow(url, exclude={"nofollow", "sponsored"}):
    soup = BeautifulSoup(requests.get(url).text)
    urls = []
    for a in soup.find_all('a'):
        rel = a.get('rel') or []
        rel = rel[0].split(',') if rel and len(rel) == 1 else rel
        if exclude & set(rel):
            # Do not spider these type of relations
            continue
        urls.append(a['href'])
    return urls

In [3]:
links_to_follow(url)

['https://en.wikipedia.org/wiki/Lorem_ipsum',
 'https://en.wikipedia.org/wiki/User-generated_content',
 'https://en.wikipedia.org/wiki/Multiplicity_(philosophy)']

<h2 style="font-weight: bold;">
    META robots
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Within the header of a web page, one or more `<meta>` tags might occur.  These can serve many purposes, but one is to instruct robots on good behavior.  Each `<meta>` tag is distinguished by having a different `name` attribute indicating its purpose.

Historically, the HTML `<meta>` tag was spelled as `<META>` instead.  Usually in such older web pages, the attributes and values are likewise in uppercase.  While that uppercase convention is fairly old, it remains fairly common in published web pages.  Beautiful Soup is aware of this issue, and treats the `<META>` tag as if it were lowercase, and also canonicalizes the `name` and `content` attributes.  However, the attribute values are not modified.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Let us take a look at the same sample page as above.

In [4]:
print(page[:213])

<html>
<head>
<title>Tagged links</title>
<meta name="googlebot" content="noindex, follow, nocache">
<meta name="robots" content="noindex, nofollow">
<META NAME="GEEZERBOT" CONTENT="NOINDEX, NOIMAGEINDEX">
</head>


In [5]:
metas = soup.find_all('meta')
metas

[<meta content="noindex, follow, nocache" name="googlebot"/>,
 <meta content="noindex, nofollow" name="robots"/>,
 <meta content="NOINDEX, NOIMAGEINDEX" name="GEEZERBOT"/>]

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

It is easy enough to canonicalize the content to lowercase.  No space splitting or listifying is done for the meta tag because Beautiful Soup knows it is special.

In [6]:
for meta in metas:
    print(meta['name'], "  \t", meta['content'].lower())

googlebot   	 noindex, follow, nocache
robots   	 noindex, nofollow
GEEZERBOT   	 noindex, noimageindex


<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Your particular web scraper should probably follow the advice for the generic "robots" name.  If you are Google, or Yandex, or Baidu, you should follow the more specific advice directed at your robot.  

Some of the rest takes modest interpretation.  Indexing and caching (including separate image indexing) are obviously things general search engines do.  Your particular web crawler may or may not index pages (either the one at hand or ones followed.  There are definitely gray areas about what does or does not amount to indexing though (e.g. is a list of "pages where the name of my company appears" an index?)

<h2 style="font-weight: bold;">
    Parsing robots.txt
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

For the most part, requests from site creators to consumers about automated use of their sites lives in the special file `robots.txt`.  A single file at the root of a given domain describes what access patterns are permitted and impermissible for robots (i.e. web scrapers, spiders, automated access).

This file consists of multiple sections, each pertaining to one or more patterns of access.  These instructions can describe what is permitted at all, or also how fast a robot may access a site.  Let us look at a simple hypothetical example.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

```
User-agent: nicebot
User-agent: fancybot
Crawl-delay: 2
# Can index archive, but not MOST 'transient' content
Allow: /archive/
Disallow: /transient/
Allow: /transient/special
```
```
User-agent: *
Crawl-delay: 10
# Other bots cannot index archive, nor query URLs
Disallow: /archive/
Disallow: /*?*
Disallow: *.php$
```

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

The pattern of what is permitted and what is not, and what might override what, can be moderately complicated to interpret.  The best way to make sure you are drawing the right conclusion, is to use `urllib.robotparser` from the standard library.  Let us start with a simple example (University of California might change this file later, but as it appears now).

In [7]:
from urllib import robotparser
import requests
berkeley = 'https://www.berkeley.edu/robots.txt'
print(requests.get(berkeley).text)

User-agent: *
Disallow: /directory/
Crawl-delay: 120



<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Let us check a few permissions for accessing Berkeley resources.

In [8]:
parser = robotparser.RobotFileParser()
parser.set_url(berkeley)
parser.read()
print("Crawl delay for specific agent:", parser.crawl_delay('bad_robot'))
print("Generic crawl delay for anyone:", parser.crawl_delay('*'), "\n")

get_map = parser.can_fetch("MyRobot", "https://www.berkeley.edu/map/")
get_dir = parser.can_fetch("MyRobot", "https://www.berkeley.edu/directory/")
print("I may crawl the map:  ", get_map)
print("I may crawl directory:", get_dir)

Crawl delay for specific agent: 120
Generic crawl delay for anyone: 120 

I may crawl the map:   True
I may crawl directory: False


<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Berkeley was very generic in their rules. Many sites are more specific.  Let us look at Project Gutenberg.

In [9]:
pg = "https://www.gutenberg.org/robots.txt"
rp = robotparser.RobotFileParser()
rp.set_url(pg)
rp.read()
print(str(rp.default_entry)[:300], '...')

User-agent: *
Disallow: /etext
Disallow: /dirs/etext
Disallow: /dirs/1
Disallow: /dirs/2
Disallow: /dirs/3
Disallow: /dirs/4
Disallow: /dirs/5
Disallow: /dirs/6
Disallow: /dirs/7
Disallow: /dirs/8
Disallow: /dirs/9
Disallow: /catalog/world/
Disallow: /ebooks/search
Disallow: /ebooks/send/
Disallow:  ...


<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

What things are permitted and which are not?

In [10]:
rp.can_fetch("AdsBot-Google", "https://www.gutenberg.org/help/faq.html")

False

In [11]:
rp.can_fetch("MyRobot", "https://www.gutenberg.org/help/faq.html")

True

In [12]:
rp.can_fetch("MyRobot", "https://www.gutenberg.org/ratelimiter")

False

<h2 style="font-weight: bold;">
    Good manners (Summary)
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The standard for being polite to websites is fairly simple.  Create a `RobotFileParser` before you access additional resources in a particular domain.  Call `.crawl_delay(my_robot_name)` once at the start of crawling, and add delays in obtaining resources matching the request the website makes.  The parser will figure out which rule your robot name belongs to.  

Then for each URL you are considering accessing, ask the `.can_fetch(my_robot_name, url)` question before actually retrieving it.  Making that call does not utilize any network connection after the initial parser is created, it just analyzes the combined rules which will not take long.

<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img></div>

Admittedly, it is not quite as simple if you look at `<meta name="robots" ...>` directives and `<a rel="...">` attributes.  But building all these together is not especially difficult.

When you are considering adding a URL to the list of those you will crawl, check whether you obtained it via a link tag with a prohibited relation, usually "nofollow".  If so, probably do not add it.  When you retrieve a new page that is otherwise permitted by `robots.txt`, check the `<meta>` tag to decide whether you will obtain new links from its contents.

Performing all of these checks is only a few lines of code, mostly the ones shown in this lesson.  Generalizing them to whatever web scraping library you might be using is not difficult.