Introduction to the Web Scraping
-----

The web keeps getting bigger...

![](http://i.marketingprofs.com/assets/images/daily-data-point/data-meeker-030614.jpg)

There is a lot of data on the web.   

Let's start figuring out how to get some!

By The End Of This Session You Should Be Able To:
----

- Explain how the Internet works so we can bend it to our will
- Perform basic approach to do web scraping

Web vs Internet vs HTTP
-----

- The Web can be thought of as a series of nodes or islands (addresses operated by servers). 
- The Internet is like land bridges connecting the islands.
- HTTP is a protocol for the Web

HTTP 
-----

“http” is a special set of rules for requesting and receiving web content

Just as a standard application has the CRUD actions, everything you do on the Web is simply one of these four things:

- GET: like"Read"
- POST: like "Create"
- PUT: like "Update" or edit
- DELETE: like "Delete" 

If you have used REST API, this should be familar.

### Warm up - Getting .txt Data

Let's start with downloading raw data using wget. 

In [1]:
# If you don't have wget installed
! brew install wget

[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 2 taps (homebrew/core, homebrew/dupes).
[34m==>[0m [1mNew Formulae[0m
loc          neatvi       osmfilter    pacparser    urbit        willgit
[34m==>[0m [1mUpdated Formulae[0m
algernon                                 libdivecomputer
apache-geode                             libmagic
archi-steam-farm                         libphonenumber
argon2                                   logstash
aws-apigateway-importer                  lynis
aws-sdk-cpp                              m-cli
[1mawscli [32m✔[0m[0m                                 macvim
bashdb                                   makepkg
bib-tool                                 mighttpd2
cabal-install                            mitmproxy
ccache                                   mpv
chaiscript                               nats-streaming-server
checkstyle                               ncmpcpp
chromedriver                             ncrack
cli53                              

In [2]:
# How about some light reading with Ulysses by James Joyce?
# Let's download it
! wget http://www.gutenberg.org/files/4300/4300-0.txt

--2016-10-31 09:59:11--  http://www.gutenberg.org/files/4300/4300-0.txt
Resolving www.gutenberg.org... 152.19.134.47, 2610:28:3090:3000::bad:cafe:47
Connecting to www.gutenberg.org|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1580914 (1.5M) [text/plain]
Saving to: ‘4300-0.txt’


2016-10-31 09:59:14 (611 KB/s) - ‘4300-0.txt’ saved [1580914/1580914]



In [3]:
# Load the data into namespace
with open('4300-0.txt', 'r') as f:
    ulysses = f.read()

In [6]:
# Remeber: Always visually inspect your data
print(ulysses[585:1000])



by James Joyce






— I —





[ 1 ]

Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
lather on which a mirror and a razor lay crossed. A yellow dressinggown,
ungirdled, was sustained gently behind him on the mild morning air. He
held the bowl aloft and intoned:

—Introibo ad altare Dei.

Halted, he peered down the dark winding stairs and called out coarsely:

—Come up, Kinch! Come up,


In [4]:
# Ready for data science
# How often does the main character's name appear?
ulysses.count('Leopold Bloom')

15

### Scraping A Webpage

In [8]:
! pip install BeautifulSoup

Collecting BeautifulSoup
  Downloading BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/pip-build-1ex5jq60/BeautifulSoup/setup.py", line 22
        print "Unit tests have failed!"
                                      ^
    SyntaxError: Missing parentheses in call to 'print'
    
    ----------------------------------------
[31mCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/pip-build-1ex5jq60/BeautifulSoup/[0m


In [7]:
import requests
from bs4 import BeautifulSoup

ImportError: No module named 'bs4'

requests is for getting data (Python for wget)

BeautifulSoup is for parsing html.

[List of state universities in the United States](https://en.wikipedia.org/wiki/List_of_state_universities_in_the_United_States)

In [5]:
url_univ = "https://en.wikipedia.org/wiki/List_of_state_universities_in_the_United_States"

In [6]:
# Get the webpage
r_univ = requests.get(url_univ)

NameError: name 'requests' is not defined

In [31]:
# Get the "soup"
soup_univ = BeautifulSoup(r_univ.content, 'html.parser')

In [32]:
# Have a look at the soup
print(soup_univ.body.prettify())

<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-List_of_state_universities_in_the_United_States rootpage-List_of_state_universities_in_the_United_States skin-vector action-view feature-footer-v2">
 <div class="noprint" id="mw-page-base">
 </div>
 <div class="noprint" id="mw-head-base">
 </div>
 <div class="mw-body" id="content" role="main">
  <a id="top">
  </a>
  <div id="siteNotice">
   <!-- CentralNotice -->
  </div>
  <div class="mw-indicators">
  </div>
  <h1 class="firstHeading" id="firstHeading" lang="en">
   List of state universities in the United States
  </h1>
  <div class="mw-body-content" id="bodyContent">
   <div id="siteSub">
    From Wikipedia, the free encyclopedia
   </div>
   <div id="contentSub">
   </div>
   <div class="mw-jump" id="jump-to-nav">
    Jump to:
    <a href="#mw-head">
     navigation
    </a>
    ,
    <a href="#p-search">
     search
    </a>
   </div>
   <div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="

CSS Selectors
-----

CSS (Cascading Style Sheet) is used for formatting web pages. CSS classes are used when the web developer wants to apply several formatting rules to a section of the page. Often the same CSS class is reused (e.g. every review block will be formatted the same). As scrapers, we can use these CSS classes to grab the elements that we care about.

The best way to discover what CSS selectors are used by a page is to use the Inspect Element tool in the Google Chrome browser.

In [41]:
soup_univ.select("title")

[<title>List of state universities in the United States - Wikipedia</title>]

In [43]:
soup_univ.select("body a")

[<a id="top"></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/></a>,
 <a href="/wiki/Wikipedia:Verifiability" title="Wikipedia:Verifiability">verification</a>,
 <a class="external text" href="//en.wikipedia.org/w/index.php?title=List_of_state_universities_in_the_United_States&amp;action=edit">improve this article</a>,
 <a href="/wiki/Help:Introduction_to_referencing_with_Wiki_Markup/1" title="Help:Introduction to referencing with Wiki Markup/1">adding citations to reliable sources</a>,
 <a href="/wiki/Help

###Another Example

In [54]:
topic = 'Data_science'
url = 'https://en.wikipedia.org/wiki/{0}'.format(topic)

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

print('First paragraph:', soup.find('p').text)

First paragraph: Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics,[3] similar to Knowledge Discovery in Databases (KDD).


In [55]:
print('Number of paragraphs: {0}'.format(len(soup.find('p'))))

Number of paragraphs: 21


### Summary - The Beatiful Soup Workflow
1. Find website with structured static webpage.
2. Request page and look at structure.
3. Find CSS elements that contain the data.
4. Validate and clean the data.
 

----
Challenge Exercises
-----

Let's scrape snack data from http://snackdata.com/

TODO: How many links are the page?

In [56]:
assert len(href_tags) == 304 # This might be wrong if the page updated

NameError: name 'href_tags' is not defined

TODO: Find all the "cuisine"

In [57]:
assert cuisine ==['American',
 'Chinese',
 'English',
 'French',
 'German',
 'Italian',
 'Japanese',
 'Korean',
 'Mexican',
 'Thai']

NameError: name 'cuisine' is not defined

<br>
<br> 
<br>

----