### What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

### What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

### What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

### How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

In [1]:
# read the HTML code for a web page and save as a string
import requests
url = r'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/example.html'
r = requests.get(url)

In [2]:
r

<Response [200]>

In [3]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(r.text)

In [4]:
# print out the object
print(b)
print(b.prettify())

<!DOCTYPE html>
<html lang="en">
<head>
<title>Example Web Page</title>
</head>
<body>
<h1 id="main">DAT7 Class 7</h1>
<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>
<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>
<p class="topic" id="reproducibility">Finally, we are covering reproducibility.</p>
<h2>Resource List</h2>
<p>Here are some helpful API resources:</p>
<ul id="api">
<li>API resource 1</li>
<li>API resource 2</li>
</ul>
<p>Here are some helpful web scraping resources:</p>
<ul id="scraping">
<li>Web scraping resource 1</li>
<li>Web scraping resource 2</li>
</ul>
</body>
</html>

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Example Web Page
  </title>
 </head>
 <body>
  <h1 id="main">
   DAT7 Class 7
  </h1>
  <p class="topic" id="api">
   First, we are covering APIs, which are useful for getting data.
  </p>
  <p class="topic" id="scraping">
   Then, we are c

In [5]:
# 'find' method returns the first matching Tag (and everything inside of it)
b.find(name='body')
b.find(name='h1')

<h1 id="main">DAT7 Class 7</h1>

In [6]:
# Tags allow you to access the 'inside text'
b.find(name='h1').text

'DAT7 Class 7'

In [7]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

'main'

In [8]:
# 'find_all' method is useful for finding all matching Tags
b.find(name='p')        # returns a Tag
b.find_all(name='p')    # returns a ResultSet (like a list of Tags)

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="reproducibility">Finally, we are covering reproducibility.</p>,
 <p>Here are some helpful API resources:</p>,
 <p>Here are some helpful web scraping resources:</p>]

In [9]:
# ResultSets can be sliced like lists
len(b.find_all(name='p'))
b.find_all(name='p')[0]
b.find_all(name='p')[0].text
b.find_all(name='p')[0]['id']

'api'

In [10]:
# iterate over a ResultSet
results = b.find_all(name='p')
for tag in results:
    print(tag.text)

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, we are covering reproducibility.
Here are some helpful API resources:
Here are some helpful web scraping resources:


In [11]:
# limit search by Tag attribute
b.find(name='p', attrs={'id':'scraping'})
b.find_all(name='p', attrs={'class':'topic'})
b.find_all(attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="reproducibility">Finally, we are covering reproducibility.</p>]

In [12]:
# limit search to specific sections
b.find_all(name='li')
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')

[<li>Web scraping resource 1</li>, <li>Web scraping resource 2</li>]

## EXERCISE ONE

In [13]:
# find the 'h2' tag and then print its text
b.find(name='h2').text

'Resource List'

In [14]:
# find the 'p' tag with an 'id' value of 'api' and then print its text
b.find(name='p', attrs={'id':'api'}).text

'First, we are covering APIs, which are useful for getting data.'

In [15]:
# find the first 'p' tag and then print the value of the 'id' attribute
b.find(name='p')['id']

'api'

In [16]:
# print the text of all four resources
results = b.find_all(name='li')
for tag in results:
    print(tag.text)

API resource 1
API resource 2
Web scraping resource 1
Web scraping resource 2


In [17]:
# Comprehension of the above
[l.text for l in b.find_all(name='li')]

['API resource 1',
 'API resource 2',
 'Web scraping resource 1',
 'Web scraping resource 2']

In [18]:
# print the text of only the API resources
results = b.find(name='ul', attrs={'id':'api'}).find_all(name='li')
for tag in results:
    print(tag.text)

API resource 1
API resource 2


## Scraping the IMDb website

In [19]:
# get the HTML from the Shawshank Redemption page
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')

In [20]:
r

<Response [200]>

In [21]:
# convert HTML into Soup
b = BeautifulSoup(r.text, 'lxml')
print(b)

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///title/tt0111161?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>The Shawshank Redemption (1994) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https

In [23]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

AttributeError: module 'sys' has no attribute 'setdefaultencoding'

In [24]:
# get the title
b.find(name='h1').text[:-8]

'The Shawshank Redemption'

In [25]:
# get the star rating (as a float)
float(b.find(name='span', attrs={'itemprop':'ratingValue'}).text)

9.3

## EXERCISE TWO

In [26]:
# get the description
b.find(name='div', attrs={'class':'summary_text'}).text.strip()

'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'

In [27]:
# get the content rating
b.find(name='div', attrs={'class':'subtext'}).text.strip()[0]

'R'

In [28]:
# get the duration in minutes 
b.find(name='div', attrs={'class':'subtext'}).text.split()

['R', '|', '2h', '22min', '|', 'Drama', '|', '14', 'October', '1994', '(USA)']