## Scraping

There is a lot of great data out on the web. Unfortunately, it is not all readily available via APIs. And even when APIs are available, it may restrict the data we have access to. Scraping usually refers to extracting web page content when APIs are not available. 

In the API section, we used urllib to call an API and save data. We can also use it to aid in our extraction of data from webpages.

In [1]:
import urllib.request as urllib

In [2]:
html = urllib.urlopen("http://xkcd.com/1481/")
print(html.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<link rel="stylesheet" type="text/css" href="/s/b0dcca.css" title="Default"/>\n<title>xkcd: API</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<script type="text/javascript" src="/s/b66ed7.js" async></script>\n<script type="text/javascript" src="/s/1b9456.js" async></script>\n\n</head>\n<body>\n<div id="topContainer">\n<div id="topLeft">\n<ul>\n<li><a href="/archive">Archive</a></li>\n<li><a href="http://what-if.xkcd.com">What If?</a></li>\n<li><a href="http://blag.xkcd.com">Blag</a></li>\n<li><a href="http://store.xkcd.com/">Store</a></li>\n<li><a rel="author" href="/about">About</a></li>\n</ul>\n</div>\n<div id="topRight">\n<div id="ma

We can use the urlretrieve function to retrieve a specific resources, such as a file, via url. This is basic web scraping.

If we look through our html above, we can see there is a url for the image in the page. (Look for: ```Image URL (for hotlinking/embedding): https://imgs.xkcd.com/comics/api.png```)

But before we go doing that, maybe we should check the robots.txt file first...

In [3]:
robot = urllib.urlopen("https://xkcd.com/robots.txt")
print(robot.read())

b'User-agent: *\nDisallow: /personal/'


Looks like we are good!

In [4]:
urllib.urlretrieve("http://imgs.xkcd.com/comics/api.png", "api.png")

('api.png', <http.client.HTTPMessage at 0x7f28727dbeb8>)

The cell below this is markdown. Double-click on it so it is in editing mode, then execute it to display the file you downloaded with the previous command.

![alt text](api.png)

Using these methods, we are treating the html as an unstructured string. If we want to retrieve the structured markup, we can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

Let's look at [this page](https://litemind.com/best-famous-quotes). What if we wanted to extract the quotes and authors? First, are we allowed to?

In [5]:
robot = urllib.urlopen("https://litemind.com/robots.txt")
print(robot.read())

b'User-agent: *\nDisallow: /wp-admin\nDisallow: /wp-content/cache\nDisallow: /trackback\nDisallow: */trackback\n\nAllow: /wp-content/uploads\n\nDisallow: /manifests\nDisallow: /search\nDisallow: /newsletter-verify\nDisallow: /newsletter-welcome\nDisallow: /mind-explorations-ebook\nDisallow: /best-of-litemind-ebook\nDisallow: /wp-content/uploads/misc/best-of-litemind-ebook.pdf\n\n\n# BEGIN XML-SITEMAP-PLUGIN\nSitemap: http://litemind.com/sitemap.xml.gz\n# END XML-SITEMAP-PLUGIN\n'


The page we are scraping isn't excluded in the robots.txt file. Let's see what Beautiful Soup can do.

In [17]:
from bs4 import BeautifulSoup
url = "https://litemind.com/best-famous-quotes"

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head profile="http://gmpg.org/xfn/11">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="K2 1.0" name="template"/>
  <title>
   60 Selected Best Famous Quotes - Litemind
  </title>
  <link href="https://litemind.com/wp-content/themes/escher/style.css" media="all" rel="stylesheet" type="text/css"/>
  <link href="" rel="pingback"/>
  <script type="text/javascript">
   var ajaxurl='https://litemind.com/wp-admin/admin-ajax.php';
  </script>
  <link href="https://litemind.com/best-famous-quotes/" rel="canonical"/>
  <meta content="en_US" property="og:locale"/>
  <meta content="article" property="og:type"/>
  <meta content="60 Selected Best Famous Quotes 

In the cell above, we read our web page with urllib (we can also use the [requests](http://docs.python-requests.org/en/master/) library), then parsed with with the Beautiful Soup html parser. You can read about the different parser option [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use).

Our parsed data is now in a variable called "soup". We used the ["prettify"](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output) method to print something a little more readable. Beautiful Soup has represented the html document as a nested data structure that we can navigate.

Beautiful Soup lets you access information through tags in the html. The tags are the same as the ones in the document. 

In [18]:
soup.title

<title>60 Selected Best Famous Quotes - Litemind</title>

Tags have names.

In [19]:
soup.title.name

'title'

Sometimes they have attributes too. 

In [20]:
soup.title.attr

But title does not. It does contain a string though.

In [21]:
soup.title.string

'60 Selected Best Famous Quotes - Litemind'

We can look at just the head of the page.

In [22]:
soup.head

<head profile="http://gmpg.org/xfn/11"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="K2 1.0" name="template"/><title>60 Selected Best Famous Quotes - Litemind</title><link href="https://litemind.com/wp-content/themes/escher/style.css" media="all" rel="stylesheet" type="text/css"/><link href="" rel="pingback"/> <script type="text/javascript">var ajaxurl='https://litemind.com/wp-admin/admin-ajax.php';</script> <link href="https://litemind.com/best-famous-quotes/" rel="canonical"/><meta content="en_US" property="og:locale"/><meta content="article" property="og:type"/><meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/><meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intriguing to the just plain funny." property="og:description"/><meta content="https://litemind.com/best-famous-quotes/" property="og:url"/><meta content="Litemind" property="og:site_nam

Or the body.

In [23]:
soup.body

<body class="wordpress k2 y2018 m12 d05 h08 single postid-43 s-slug-best-famous-quotes s-y2008 s-m05 s-d19 s-h07 s-category-personal-development s-tag-quotes s-author-luciano-passuello columns-three lang-en wpmu-1 webkit safari chrome win"><div id="page"><div id="header"><div class="blog-title"> <a accesskey="1" href="https://litemind.com/">Litemind</a></div><p class="description">Exploring ways to use our minds efficiently.</p><ul class="menu"><li class="page_item blogtab"> <a href="https://litemind.com/" title="Home"> Home		</a></li><li class="page_item page-item-2"><a href="https://litemind.com/about/">About</a></li></ul></div><hr/><div class="content"><div id="primary-wrapper"><div id="primary"><div id="notices"></div> <a id="startcontent" name="startcontent"></a><div class="hfeed" id="current-content"><div class="navigation" id="nav-above"><div class="nav-previous"><a href="https://litemind.com/study-matrix-mind-map-showcase/" rel="prev"><span class="meta-nav">«</span> Study Matri

If we look through the body, we can see our quotes are contained here, starting after 
```<h2>Wisdom Quotes</h2>```


In [24]:
soup.h2

<h2>Wisdom Quotes</h2>

In [25]:
soup.h2.text

'Wisdom Quotes'

Tags have attributes that allow us to [navigate](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) through the structure of the document as well. We can navigate up and down a document's structure by looking at a tag's child and parent attributes. 

In [26]:
soup.body.parent

<html lang="en-US" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/"><head profile="http://gmpg.org/xfn/11"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="K2 1.0" name="template"/><title>60 Selected Best Famous Quotes - Litemind</title><link href="https://litemind.com/wp-content/themes/escher/style.css" media="all" rel="stylesheet" type="text/css"/><link href="" rel="pingback"/> <script type="text/javascript">var ajaxurl='https://litemind.com/wp-admin/admin-ajax.php';</script> <link href="https://litemind.com/best-famous-quotes/" rel="canonical"/><meta content="en_US" property="og:locale"/><meta content="article" property="og:type"/><meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/><meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intrig

In [27]:
soup.head.parent

<html lang="en-US" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/"><head profile="http://gmpg.org/xfn/11"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="K2 1.0" name="template"/><title>60 Selected Best Famous Quotes - Litemind</title><link href="https://litemind.com/wp-content/themes/escher/style.css" media="all" rel="stylesheet" type="text/css"/><link href="" rel="pingback"/> <script type="text/javascript">var ajaxurl='https://litemind.com/wp-admin/admin-ajax.php';</script> <link href="https://litemind.com/best-famous-quotes/" rel="canonical"/><meta content="en_US" property="og:locale"/><meta content="article" property="og:type"/><meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/><meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intrig

We can go "sideways" in a document to look at tags at the same level using sibling. Here we can see that head and body are at the same level in our document.

In [28]:
soup.head.next_sibling

<body class="wordpress k2 y2018 m12 d05 h08 single postid-43 s-slug-best-famous-quotes s-y2008 s-m05 s-d19 s-h07 s-category-personal-development s-tag-quotes s-author-luciano-passuello columns-three lang-en wpmu-1 webkit safari chrome win"><div id="page"><div id="header"><div class="blog-title"> <a accesskey="1" href="https://litemind.com/">Litemind</a></div><p class="description">Exploring ways to use our minds efficiently.</p><ul class="menu"><li class="page_item blogtab"> <a href="https://litemind.com/" title="Home"> Home		</a></li><li class="page_item page-item-2"><a href="https://litemind.com/about/">About</a></li></ul></div><hr/><div class="content"><div id="primary-wrapper"><div id="primary"><div id="notices"></div> <a id="startcontent" name="startcontent"></a><div class="hfeed" id="current-content"><div class="navigation" id="nav-above"><div class="nav-previous"><a href="https://litemind.com/study-matrix-mind-map-showcase/" rel="prev"><span class="meta-nav">«</span> Study Matri

The structure of your document will determine which of these attributes are available.

As we saw above, the quotes we want to scrape start after the second heading.

In [29]:
soup.h2.next_sibling

<div class="wp_quotepage"><div class="wp_quotepage_quote">1. You can do anything, but not everything.</div><div class="wp_quotepage_author">—David Allen</div></div>

We can chain our attributes to continue accessing things. 

In [30]:
soup.h2.next_sibling.next_sibling

<hr/>

In [31]:
soup.h2.next_sibling.next_sibling.next_sibling

<div class="wp_quotepage"><div class="wp_quotepage_quote">2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.</div><div class="wp_quotepage_author">—Antoine de Saint-Exupéry</div></div>

That seems a bit cumbersome though, right?

Beautiful Soup also allows us to [search](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) our document. A common task is to pull all of the URLs linked on a page.

In [32]:
soup.find('a')

<a accesskey="1" href="https://litemind.com/">Litemind</a>

In [33]:
soup.find_all('a')

[<a accesskey="1" href="https://litemind.com/">Litemind</a>,
 <a href="https://litemind.com/" title="Home"> Home		</a>,
 <a href="https://litemind.com/about/">About</a>,
 <a id="startcontent" name="startcontent"></a>,
 <a href="https://litemind.com/study-matrix-mind-map-showcase/" rel="prev"><span class="meta-nav">«</span> Study Matrix Mind Map Showcase</a>,
 <a href="https://litemind.com/scamper/" rel="next">Creative Problem Solving with SCAMPER <span class="meta-nav">»</span></a>,
 <a href="https://litemind.com/best-famous-quotes/" rel="bookmark" title="Permanent Link to 60 Selected Best Famous Quotes">60 Selected Best Famous Quotes</a>,
 <a href="https://litemind.com/category/personal-development/" title="View all posts in Personal Development">Personal Development</a>,
 <a class="commentslink" href="https://litemind.com/best-famous-quotes/#comments">209 <span>Comments</span></a>,
 <a href="https://litemind.com/tag/quotes/" rel="tag">Quotes</a>,
 <a href="https://litemind.com/favori

In [34]:
for link in soup.find_all('a'):
    print(link.get('href'))

https://litemind.com/
https://litemind.com/
https://litemind.com/about/
None
https://litemind.com/study-matrix-mind-map-showcase/
https://litemind.com/scamper/
https://litemind.com/best-famous-quotes/
https://litemind.com/category/personal-development/
https://litemind.com/best-famous-quotes/#comments
https://litemind.com/tag/quotes/
https://litemind.com/favorite-quotes/
https://litemind.com/favorite-quotes/
https://litemind.com/five-reasons-to-collect-favorite-quotes/
http://del.icio.us/lucianop/
http://dietrich.ganx4.com/foxylicious/
http://www.quotiki.com/
https://litemind.com/favorite-quotes/
https://litemind.com/best-famous-quotes-2/
https://litemind.com/five-reasons-to-collect-favorite-quotes/
https://litemind.com/study-matrix-mind-map-showcase/
https://litemind.com/scamper/
 https://blog.iqmatrix.com/self-discipline?utm_source=litemind&utm_medium=banner&utm_campaign=self-discipline
//litemind.com/boost-brain-power/
//litemind.com/thinking-traps/
//litemind.com/tackle-any-issue-w

We found our quotes before using:
```soup.h2.next_sibling.next_sibling.next_sibling```

We can also pull them out using find.

In [35]:
soup.find('div', class_='wp_quotepage')

<div class="wp_quotepage"><div class="wp_quotepage_quote">1. You can do anything, but not everything.</div><div class="wp_quotepage_author">—David Allen</div></div>

And we can pull them out yet another way by using [CSS Selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors).

In [36]:
soup.select('.wp_quotepage')

[<div class="wp_quotepage"><div class="wp_quotepage_quote">1. You can do anything, but not everything.</div><div class="wp_quotepage_author">—David Allen</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.</div><div class="wp_quotepage_author">—Antoine de Saint-Exupéry</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">3. The richest man is not he who has the most, but he who needs the least.</div><div class="wp_quotepage_author">—Unknown Author</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">4. You miss 100 percent of the shots you never take.</div><div class="wp_quotepage_author">—Wayne Gretzky</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">5. Courage is not the absence of fear, but rather the judgement that something else is more important than fear.</div><div class="wp_quotepage_autho

Once we have the elements we are looking for, we can write some code to pull them out.

In [37]:
for quote in soup.select('.wp_quotepage'):
    text = quote.findChildren()[0].renderContents()
    author = quote.findChildren()[1].renderContents()
    print(text, author)

b'1. You can do anything, but not everything.' b'\xe2\x80\x94David Allen'
b'2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.' b'\xe2\x80\x94Antoine de Saint-Exup\xc3\xa9ry'
b'3. The richest man is not he who has the most, but he who needs the least.' b'\xe2\x80\x94Unknown Author'
b'4. You miss 100 percent of the shots you never take.' b'\xe2\x80\x94Wayne Gretzky'
b'5. Courage is not the absence of fear, but rather the judgement that something else is more important than fear.' b'\xe2\x80\x94Ambrose Redmoon'
b'6. You must be the change you wish to see in the world.' b'\xe2\x80\x94Gandhi'
b'7. When hungry, eat your rice; when tired, close your eyes. Fools may laugh at me, but wise men will know what I mean.' b'\xe2\x80\x94Lin-Chi'
b'8. The third-rate mind is only happy when it is thinking with the majority. The second-rate mind is only happy when it is thinking with the minority. The first-rate mind is only happy when it is th

It still isn't perfect, but you can clean it up from there. 

There are a lot of resources out there for building scrapers. Do you have a page you want to scrape? If so, try it out now. We are here to answer your questions so give this a try. If you want some more ideas, here are some resources to take a look at:

**More Examples**
* [Scotch Notebook](https://github.com/nd1/pycon_2017/blob/master/scraping/scotch.ipynb) - This notebook shows the process I went through to scrape a site. It is not a polished tutorial, but instead shows some of my thought process when I am scraping.
* Tutorial for [building your first scraper](http://first-web-scraper.readthedocs.io/en/latest/)
* [Python Web Scraping Tutorial using BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [Scraping Marvel Comics](http://blog.nycdatascience.com/student-works/scraping-marvel-comics/)
* [Scraping for Craft Beers: A Dataset Creation Tutorial](http://blog.kaggle.com/2017/01/31/scraping-for-craft-beers-a-dataset-creation-tutorial/)

**Things to scrape**:
Wikipedia has a lot of good lists to practice on like [Billboard Year-End Hot 100 singles of 1960](https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960), [List of whisky distilleries in Scotland](https://en.wikipedia.org/wiki/List_of_whisky_distilleries_in_Scotland), or [List of highest-grossing Indian films](https://en.wikipedia.org/wiki/List_of_highest-grossing_Indian_films) among [other things](https://en.wikipedia.org/wiki/List_of_lists_of_lists).
