## Scraping

There is a lot of great data out on the web. Unfortunately, it is not all readily available via APIs. And even when APIs are available, it may restrict the data we have access to. Scraping usually refers to extracting web page content when APIs are not available. 

In the API section, we used urllib to call an API and save data. We can also use it to aid in our extraction of data from webpages.

In [1]:
import urllib.request

In [2]:
html = urllib.request.urlopen("http://xkcd.com/1481/")
print(html.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<script>\n  (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n  })(window,document,\'script\',\'https://www.google-analytics.com/analytics.js\',\'ga\');\n\n  ga(\'create\', \'UA-25700708-7\', \'auto\');\n  ga(\'send\', \'pageview\');\n</script>\n<link rel="stylesheet" type="text/css" href="/s/b0dcca.css" title="Default"/>\n<title>xkcd: API</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<script type="text/javascript" src="/s/b66ed7.js" async></scr

We can use the urlretrieve function to retrieve a specific resources, such as a file, via url. This is basic web scraping.

If we look through our html above, we can see there is a url for the image in the page. (Look for: ```Image URL (for hotlinking/embedding): https://imgs.xkcd.com/comics/api.png```)

But before we go doing that, maybe we should check the robots.txt file first...

In [3]:
robot = urllib.request.urlopen("https://xkcd.com/robots.txt")
print(robot.read())

b'User-agent: *\nDisallow: /personal/'


Looks like we are good!

In [None]:
urllib.request.urlretrieve("http://imgs.xkcd.com/comics/api.png", "api.png")

The cell below this is markdown. Double-click on it so it is in editing mode, then execute it to display the file you downloaded with the previous command. 

![](api.png)

Using these methods, we are treating the html as an unstructured string. If we want to retrieve the structured markup, we can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

Let's look at [this page](https://litemind.com/best-famous-quotes). What if we wanted to extract the quotes and authors? First, are we allowed to?

In [4]:
robot = urllib.request.urlopen("https://litemind.com/robots.txt")
print(robot.read())

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)>

The page we are scraping isn't excluded in the robots.txt file. Let's see what Beautiful Soup can do.

In [None]:
from bs4 import BeautifulSoup
url = "https://litemind.com/best-famous-quotes"

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
print(soup.prettify())

In the cell above, we read our web page with urllib (we can also use the [requests](http://docs.python-requests.org/en/master/) library), then parsed with with the Beautiful Soup html parser. You can read about the different parser option [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use).

Our parsed data is now in a variable called "soup". We used the ["prettify"](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output) method to print something a little more readable. Beautiful Soup has represented the html document as a nested data structure that we can navigate.

Beautiful Soup lets you access information through tags in the html. The tags are the same as the ones in the document. 

In [None]:
soup.title

Tags have names.

In [None]:
soup.title.name

Sometimes they have attributes too. 

In [None]:
soup.title.attr

But title does not. It does contain a string though.

In [None]:
soup.title.string

We can look at just the head of the page.

In [None]:
soup.head

Or the body.

In [None]:
soup.body

If we look through the body, we can see our quotes are contained here, starting after 
```<h2>Wisdom Quotes</h2>```


In [None]:
soup.h2

In [None]:
soup.h2.text

Tags have attributes that allow us to [navigate](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) through the structure of the document as well. We can navigate up and down a document's structure by looking at a tag's child and parent attributes. 

In [None]:
soup.body.parent

In [None]:
soup.head.parent

We can go "sideways" in a document to look at tags at the same level using sibling. Here we can see that head and body are at the same level in our document.

In [None]:
soup.head.next_sibling

The structure of your document will determine which of these attributes are available.

As we saw above, the quotes we want to scrape start after the second heading.

In [None]:
soup.h2.next_sibling

We can chain our attributes to continue accessing things. 

In [None]:
soup.h2.next_sibling.next_sibling

In [None]:
soup.h2.next_sibling.next_sibling.next_sibling

That seems a bit cumbersome though, right?

Beautiful Soup also allows us to [search](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) our document. A common task is to pull all of the URLs linked on a page.

In [None]:
soup.find('a')

In [None]:
soup.find_all('a')

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

We found our quotes before using:
```soup.h2.next_sibling.next_sibling.next_sibling```

We can also pull them out using find.

In [None]:
soup.find('div', class_='wp_quotepage')

And we can pull them out yet another way by using [CSS Selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors).

In [None]:
soup.select('.wp_quotepage')

Once we have the elements we are looking for, we can write some code to pull them out.

In [None]:
for quote in soup.select('.wp_quotepage'):
    text = quote.findChildren()[0].renderContents()
    author = quote.findChildren()[1].renderContents()
    print(text, author)

It still isn't perfect, but you can clean it up from there. 

There are a lot of resources out there for building scrapers. Do you have a page you want to scrape? If so, try it out now. We are here to answer your questions so give this a try. If you want some more ideas, here are some resources to take a look at:

**More Examples**
* [Scotch Notebook](https://github.com/nd1/pycon_2017/blob/master/scraping/scotch.ipynb) - This notebook shows the process I went through to scrape a site. It is not a polished tutorial, but instead shows some of my thought process when I am scraping.
* Tutorial for [building your first scraper](http://first-web-scraper.readthedocs.io/en/latest/)
* [Python Web Scraping Tutorial using BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [Scraping Marvel Comics](http://blog.nycdatascience.com/student-works/scraping-marvel-comics/)
* [Scraping for Craft Beers: A Dataset Creation Tutorial](http://blog.kaggle.com/2017/01/31/scraping-for-craft-beers-a-dataset-creation-tutorial/)

**Things to scrape**:
Wikipedia has a lot of good lists to practice on like [Billboard Year-End Hot 100 singles of 1960](https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960), [List of whisky distilleries in Scotland](https://en.wikipedia.org/wiki/List_of_whisky_distilleries_in_Scotland), or [List of highest-grossing Indian films](https://en.wikipedia.org/wiki/List_of_highest-grossing_Indian_films) among [other things](https://en.wikipedia.org/wiki/List_of_lists_of_lists).
