## Scraping

There is a lot of great data out on the web. Unfortunately, it is not all readily available via APIs. And even when APIs are available, it may restrict the data we have access to. Scraping usually refers to extracting web page content when APIs are not available. 

In the API section, we used urllib to call an API and save data. We can also use it to aid in our extraction of data from webpages.

In [2]:
import urllib
print('Loading Libraries')

Loading Libraries


In [6]:
html = urllib.request.urlopen("http://xkcd.com/1481/")
print(html.read())
#print(response.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<link rel="stylesheet" type="text/css" href="/s/7d94e0.css" title="Default"/>\n<title>xkcd: API</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<script type="text/javascript" src="/s/b66ed7.js" async></script>\n<script type="text/javascript" src="/s/1b9456.js" async></script>\n\n<meta property="og:site_name" content="xkcd">\n\n<meta property="og:title" content="API">\n<meta property="og:url" content="https://xkcd.com/1481/">\n<meta property="og:image" content="https://imgs.xkcd.com/comics/api_2x.png">\n<meta name="twitter:card" content="summary_large_image">\n\n</head>\n<body>\n<div id="topContainer">\n<div id="topLeft">\n<ul>\n<li><a hre

We can use the urlretrieve function to retrieve a specific resources, such as a file, via url. This is basic web scraping.

If we look through our html above, we can see there is a url for the image in the page. (Look for: ```Image URL (for hotlinking/embedding): https://imgs.xkcd.com/comics/api.png```)

But before we go doing that, maybe we should check the robots.txt file first...

In [7]:
robot = urllib.request.urlopen("https://xkcd.com/robots.txt")
print(robot.read())

b'User-agent: *\nDisallow: /personal/'


Looks like we are good!

In [8]:
urllib.request.urlretrieve("http://imgs.xkcd.com/comics/api.png", "api.png")

('api.png', <http.client.HTTPMessage at 0x2cb318dff08>)

The cell below this is markdown. Double-click on it so it is in editing mode, then execute it to display the file you downloaded with the previous command.

![](api.png)

Using these methods, we are treating the html as an unstructured string. If we want to retrieve the structured markup, we can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

Let's look at [this page](https://litemind.com/best-famous-quotes). What if we wanted to extract the quotes and authors? First, are we allowed to?

In [None]:
robot = urllib.urlopen("https://litemind.com/robots.txt")
print(robot.read())

The page we are scraping isn't excluded in the robots.txt file. Let's see what Beautiful Soup can do.

In [10]:
from bs4 import BeautifulSoup
url = "https://litemind.com/best-famous-quotes"

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <link href="/superpwa-manifest.json" rel="manifest"/>
  <meta content="#D5E0EB" name="theme-color"/>
  <title>
   60 Selected Best Famous Quotes - Litemind
  </title>
  <meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/>
  <link href="https://litemind.com/best-famous-quotes/" rel="canonical"/>
  <meta content="en_US" property="og:locale"/>
  <meta content="article" property="og:type"/>
  <meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/>
  <meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intriguing to the just plain funny." property="og:description"/>
  <meta content="https://litemind.com/best-famous-quotes/" property="og:url"/>
  <meta content

In the cell above, we read our web page with urllib (we can also use the [requests](http://docs.python-requests.org/en/master/) library), then parsed with with the Beautiful Soup html parser. You can read about the different parser option [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use).

Our parsed data is now in a variable called "soup". We used the ["prettify"](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output) method to print something a little more readable. Beautiful Soup has represented the html document as a nested data structure that we can navigate.

Beautiful Soup lets you access information through tags in the html. The tags are the same as the ones in the document. 

In [11]:
soup.title

<title>60 Selected Best Famous Quotes - Litemind</title>

Tags have names.

In [12]:
soup.title.name

'title'

Sometimes they have attributes too. 

In [14]:
soup.title.attr

But title does not. It does contain a string though.

In [15]:
soup.title.string

'60 Selected Best Famous Quotes - Litemind'

We can look at just the head of the page.

In [16]:
soup.head

<head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="https://gmpg.org/xfn/11" rel="profile"/><link href="/superpwa-manifest.json" rel="manifest"/><meta content="#D5E0EB" name="theme-color"/><title>60 Selected Best Famous Quotes - Litemind</title><meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/><link href="https://litemind.com/best-famous-quotes/" rel="canonical"/><meta content="en_US" property="og:locale"/><meta content="article" property="og:type"/><meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/><meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intriguing to the just plain funny." property="og:description"/><meta content="https://litemind.com/best-famous-quotes/" property="og:url"/><meta content="Litemind" property="og:site_name"/><meta content="Quotes" property="article:tag"/><m

Or the body.

In [17]:
soup.body

<body class="post-template-default single single-post postid-43 single-format-standard ast-desktop ast-plain-container ast-right-sidebar astra-2.4.3 ast-header-custom-item-inside ast-blog-single-style-1 ast-single-post ast-mobile-inherit-site-logo ast-inherit-site-logo-transparent ast-normal-title-enabled" itemscope="itemscope" itemtype="https://schema.org/Blog"><div class="hfeed site" id="page"> <a class="skip-link screen-reader-text" href="#content">Skip to content</a><header class="site-header ast-primary-submenu-animation-fade header-main-layout-1 ast-primary-menu-enabled ast-logo-title-inline ast-hide-custom-menu-mobile ast-menu-toggle-icon ast-mobile-header-inline" id="masthead" itemscope="itemscope" itemtype="https://schema.org/WPHeader"><div class="main-header-bar-wrap"><div class="main-header-bar"><div class="ast-container"><div class="ast-flex main-header-container"><div class="site-branding"><div class="ast-site-identity" itemscope="itemscope" itemtype="https://schema.org/Or

If we look through the body, we can see our quotes are contained here, starting after 
```<h2>Wisdom Quotes</h2>```


In [18]:
soup.h2

<h2>Wisdom Quotes</h2>

In [19]:
soup.h2.text

'Wisdom Quotes'

Tags have attributes that allow us to [navigate](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) through the structure of the document as well. We can navigate up and down a document's structure by looking at a tag's child and parent attributes. 

In [20]:
soup.body.parent

<html lang="en-US"><head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="https://gmpg.org/xfn/11" rel="profile"/><link href="/superpwa-manifest.json" rel="manifest"/><meta content="#D5E0EB" name="theme-color"/><title>60 Selected Best Famous Quotes - Litemind</title><meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/><link href="https://litemind.com/best-famous-quotes/" rel="canonical"/><meta content="en_US" property="og:locale"/><meta content="article" property="og:type"/><meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/><meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intriguing to the just plain funny." property="og:description"/><meta content="https://litemind.com/best-famous-quotes/" property="og:url"/><meta content="Litemind" property="og:site_name"/><meta content="Quotes" propert

In [21]:
soup.head.parent

<html lang="en-US"><head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="https://gmpg.org/xfn/11" rel="profile"/><link href="/superpwa-manifest.json" rel="manifest"/><meta content="#D5E0EB" name="theme-color"/><title>60 Selected Best Famous Quotes - Litemind</title><meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/><link href="https://litemind.com/best-famous-quotes/" rel="canonical"/><meta content="en_US" property="og:locale"/><meta content="article" property="og:type"/><meta content="60 Selected Best Famous Quotes - Litemind" property="og:title"/><meta content="These are the very best 60 quotes, from nearly a decade of collecting them. They range from the profound to the intriguing to the just plain funny." property="og:description"/><meta content="https://litemind.com/best-famous-quotes/" property="og:url"/><meta content="Litemind" property="og:site_name"/><meta content="Quotes" propert

We can go "sideways" in a document to look at tags at the same level using sibling. Here we can see that head and body are at the same level in our document.

In [22]:
soup.head.next_sibling

<body class="post-template-default single single-post postid-43 single-format-standard ast-desktop ast-plain-container ast-right-sidebar astra-2.4.3 ast-header-custom-item-inside ast-blog-single-style-1 ast-single-post ast-mobile-inherit-site-logo ast-inherit-site-logo-transparent ast-normal-title-enabled" itemscope="itemscope" itemtype="https://schema.org/Blog"><div class="hfeed site" id="page"> <a class="skip-link screen-reader-text" href="#content">Skip to content</a><header class="site-header ast-primary-submenu-animation-fade header-main-layout-1 ast-primary-menu-enabled ast-logo-title-inline ast-hide-custom-menu-mobile ast-menu-toggle-icon ast-mobile-header-inline" id="masthead" itemscope="itemscope" itemtype="https://schema.org/WPHeader"><div class="main-header-bar-wrap"><div class="main-header-bar"><div class="ast-container"><div class="ast-flex main-header-container"><div class="site-branding"><div class="ast-site-identity" itemscope="itemscope" itemtype="https://schema.org/Or

The structure of your document will determine which of these attributes are available.

As we saw above, the quotes we want to scrape start after the second heading.

In [23]:
soup.h2.next_sibling

<div class="wp_quotepage"><div class="wp_quotepage_quote">1. You can do anything, but not everything.</div><div class="wp_quotepage_author">—David Allen</div></div>

We can chain our attributes to continue accessing things. 

In [None]:
soup.h2.next_sibling.next_sibling

In [None]:
soup.h2.next_sibling.next_sibling.next_sibling

That seems a bit cumbersome though, right?

Beautiful Soup also allows us to [search](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) our document. A common task is to pull all of the URLs linked on a page.

In [24]:
soup.find('a')

<a class="skip-link screen-reader-text" href="#content">Skip to content</a>

In [25]:
soup.find_all('a')

[<a class="skip-link screen-reader-text" href="#content">Skip to content</a>,
 <a href="https://litemind.com/" itemprop="url" rel="home"> Litemind </a>,
 <a href="https://litemind.com/category/decision-making/">Decision Making</a>,
 <a href="https://litemind.com/category/brainpower/">Brainpower</a>,
 <a href="https://litemind.com/category/creativity/">Creativity</a>,
 <a href="https://litemind.com/category/productivity/">Productivity</a>,
 <a href="https://litemind.com/category/personal-development/">Personal Development</a>,
 <a aria-label="Search icon link" class="slide-search astra-search-icon" href="#"> <span class="screen-reader-text">Search</span> </a>,
 <a href="https://litemind.com/category/personal-development/" rel="category tag">Personal Development</a>,
 <a href="https://litemind.com/favorite-quotes/" title="Full Favorite Quotes Collection">browse the entire collection</a>,
 <a href="https://litemind.com/favorite-quotes/" title="Full Favorite Quotes Collection">my favorite 

In [26]:
for link in soup.find_all('a'):
    print(link.get('href'))

#content
https://litemind.com/
https://litemind.com/category/decision-making/
https://litemind.com/category/brainpower/
https://litemind.com/category/creativity/
https://litemind.com/category/productivity/
https://litemind.com/category/personal-development/
#
https://litemind.com/category/personal-development/
https://litemind.com/favorite-quotes/
https://litemind.com/favorite-quotes/
https://litemind.com/five-reasons-to-collect-favorite-quotes/
http://del.icio.us/lucianop/
http://dietrich.ganx4.com/foxylicious/
http://www.quotiki.com/
https://litemind.com/favorite-quotes/
https://litemind.com/best-famous-quotes-2/
https://litemind.com/five-reasons-to-collect-favorite-quotes/
https://litemind.com/study-matrix-mind-map-showcase/
https://litemind.com/scamper/
http://bit.ly/visual-tools_IQmatrix
//litemind.com/boost-brain-power/
//litemind.com/thinking-traps/
//litemind.com/tackle-any-issue-with-a-list-of-100/
//litemind.com/best-famous-quotes/
//litemind.com/problem-definition/
//litemin

We found our quotes before using:
```soup.h2.next_sibling.next_sibling.next_sibling```

We can also pull them out using find.

In [27]:
soup.find('div', class_='wp_quotepage')

<div class="wp_quotepage"><div class="wp_quotepage_quote">1. You can do anything, but not everything.</div><div class="wp_quotepage_author">—David Allen</div></div>

And we can pull them out yet another way by using [CSS Selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors).

In [28]:
soup.select('.wp_quotepage')

[<div class="wp_quotepage"><div class="wp_quotepage_quote">1. You can do anything, but not everything.</div><div class="wp_quotepage_author">—David Allen</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">2. Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.</div><div class="wp_quotepage_author">—Antoine de Saint-Exupéry</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">3. The richest man is not he who has the most, but he who needs the least.</div><div class="wp_quotepage_author">—Unknown Author</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">4. You miss 100 percent of the shots you never take.</div><div class="wp_quotepage_author">—Wayne Gretzky</div></div>,
 <div class="wp_quotepage"><div class="wp_quotepage_quote">5. Courage is not the absence of fear, but rather the judgement that something else is more important than fear.</div><div class="wp_quotepage_autho

Once we have the elements we are looking for, we can write some code to pull them out.

In [None]:
for quote in soup.select('.wp_quotepage'):
    text = quote.findChildren()[0].renderContents()
    author = quote.findChildren()[1].renderContents()
    print(text, author)

It still isn't perfect, but you can clean it up from there. 

There are a lot of resources out there for building scrapers. Do you have a page you want to scrape? If so, try it out now. We are here to answer your questions so give this a try. If you want some more ideas, here are some resources to take a look at:

**More Examples**
* [Scotch Notebook](https://github.com/nd1/pycon_2017/blob/master/scraping/scotch.ipynb) - This notebook shows the process I went through to scrape a site. It is not a polished tutorial, but instead shows some of my thought process when I am scraping.
* Tutorial for [building your first scraper](http://first-web-scraper.readthedocs.io/en/latest/)
* [Python Web Scraping Tutorial using BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [Scraping Marvel Comics](http://blog.nycdatascience.com/student-works/scraping-marvel-comics/)
* [Scraping for Craft Beers: A Dataset Creation Tutorial](http://blog.kaggle.com/2017/01/31/scraping-for-craft-beers-a-dataset-creation-tutorial/)

**Things to scrape**:
Wikipedia has a lot of good lists to practice on like [Billboard Year-End Hot 100 singles of 1960](https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960), [List of whisky distilleries in Scotland](https://en.wikipedia.org/wiki/List_of_whisky_distilleries_in_Scotland), or [List of highest-grossing Indian films](https://en.wikipedia.org/wiki/List_of_highest-grossing_Indian_films) among [other things](https://en.wikipedia.org/wiki/List_of_lists_of_lists).
