# Web Scraping with BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

### Aquire Your Data

In [1]:
# read the HTML code for a web page and save as a string
#load into memory

with open('../data/example.html', 'rU') as f:
    html = f.read()

### Look at / Explorre Your Data (html)

In [5]:
html

"<!DOCTYPE html>\n<html lang='en'>\n\n<head>\n    <title>Example Web Page</title>\n</head>\n\n<body>\n\n    <h1 id='main'>DAT10 Class 6</h1>\n\n    <p class='topic' id='api'>First, we are covering APIs, which are useful for getting data.</p>\n    <p class='topic' id='scraping'>Then, we are covering web scraping, which is a more flexible way to get data.</p>\n    <p class='topic' id='feedback'>Finally, I will ask you to fill out yet another feedback form!</p>\n\n    <h2>Resource List</h2>\n\n    <p>Here are some helpful API resources:</p>\n\n    <ul id='api'>\n        <li>API resource 1</li>\n        <li>API resource 2</li>\n    </ul>\n\n    <p>Here are some helpful web scraping resources:</p>\n\n    <ul id='scraping'>\n        <li>Web scraping resource 1</li>\n        <li>Web scraping resource 2</li>\n    </ul>\n\n</body>\n\n</html>\n"

In [2]:
print html

<!DOCTYPE html>
<html lang='en'>

<head>
    <title>Example Web Page</title>
</head>

<body>

    <h1 id='main'>DAT10 Class 6</h1>

    <p class='topic' id='api'>First, we are covering APIs, which are useful for getting data.</p>
    <p class='topic' id='scraping'>Then, we are covering web scraping, which is a more flexible way to get data.</p>
    <p class='topic' id='feedback'>Finally, I will ask you to fill out yet another feedback form!</p>

    <h2>Resource List</h2>

    <p>Here are some helpful API resources:</p>

    <ul id='api'>
        <li>API resource 1</li>
        <li>API resource 2</li>
    </ul>

    <p>Here are some helpful web scraping resources:</p>

    <ul id='scraping'>
        <li>Web scraping resource 1</li>
        <li>Web scraping resource 2</li>
    </ul>

</body>

</html>



In [3]:
# Beautiful Soup parses, provides tools for parsing and taking things out
# convert HTML into a structured Soup object

from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'html.parser')

In [4]:
# print out the object
print b

<!DOCTYPE html>

<html lang="en">
<head>
<title>Example Web Page</title>
</head>
<body>
<h1 id="main">DAT10 Class 6</h1>
<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>
<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>
<p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>
<h2>Resource List</h2>
<p>Here are some helpful API resources:</p>
<ul id="api">
<li>API resource 1</li>
<li>API resource 2</li>
</ul>
<p>Here are some helpful web scraping resources:</p>
<ul id="scraping">
<li>Web scraping resource 1</li>
<li>Web scraping resource 2</li>
</ul>
</body>
</html>



In [8]:
print b.prettify()

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Example Web Page
  </title>
 </head>
 <body>
  <h1 id="main">
   DAT10 Class 6
  </h1>
  <p class="topic" id="api">
   First, we are covering APIs, which are useful for getting data.
  </p>
  <p class="topic" id="scraping">
   Then, we are covering web scraping, which is a more flexible way to get data.
  </p>
  <p class="topic" id="feedback">
   Finally, I will ask you to fill out yet another feedback form!
  </p>
  <h2>
   Resource List
  </h2>
  <p>
   Here are some helpful API resources:
  </p>
  <ul id="api">
   <li>
    API resource 1
   </li>
   <li>
    API resource 2
   </li>
  </ul>
  <p>
   Here are some helpful web scraping resources:
  </p>
  <ul id="scraping">
   <li>
    Web scraping resource 1
   </li>
   <li>
    Web scraping resource 2
   </li>
  </ul>
 </body>
</html>



In [10]:
#b . 'tab'
b.get_text

<bound method BeautifulSoup.get_text of <!DOCTYPE html>\n\n<html lang="en">\n<head>\n<title>Example Web Page</title>\n</head>\n<body>\n<h1 id="main">DAT10 Class 6</h1>\n<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>\n<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>\n<p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>\n<h2>Resource List</h2>\n<p>Here are some helpful API resources:</p>\n<ul id="api">\n<li>API resource 1</li>\n<li>API resource 2</li>\n</ul>\n<p>Here are some helpful web scraping resources:</p>\n<ul id="scraping">\n<li>Web scraping resource 1</li>\n<li>Web scraping resource 2</li>\n</ul>\n</body>\n</html>\n>

#### 'find' method returns the first matching Tag (and everything inside of it)

In [5]:
#For pulling out tags from html
b.find(name='body')

<body>\n<h1 id="main">DAT10 Class 6</h1>\n<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>\n<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>\n<p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>\n<h2>Resource List</h2>\n<p>Here are some helpful API resources:</p>\n<ul id="api">\n<li>API resource 1</li>\n<li>API resource 2</li>\n</ul>\n<p>Here are some helpful web scraping resources:</p>\n<ul id="scraping">\n<li>Web scraping resource 1</li>\n<li>Web scraping resource 2</li>\n</ul>\n</body>

In [6]:
b.find(name='h1')

<h1 id="main">DAT10 Class 6</h1>

In [9]:
#get rid of tags to inside text
title=b.find(name='h1')
title

<h1 id="main">DAT10 Class 6</h1>

In [10]:
#turns to list
title.contents

[u'DAT10 Class 6']

In [11]:
#turns to string
title.text

u'DAT10 Class 6'

In [12]:
# Tags allow you to access the 'inside text'
b.find(name='h1').text

u'DAT10 Class 6'

In [14]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

u'main'

In [13]:
#Q: how to tell which tags have attributes and which do not
b.find(name='body')['id']

KeyError: 'id'

#### 'find_all' method is useful for finding all matching Tags

In [16]:
b.find_all(name='p')    # returns a ResultSet (like a list of Tags)

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>,
 <p>Here are some helpful API resources:</p>,
 <p>Here are some helpful web scraping resources:</p>]

### Quiz: What is the datatype returned by 'find_all'? What kinds of operations can we do on that datatype?

In [15]:
#A result set is a list of tags
type(b.find_all(name='p'))

bs4.element.ResultSet

In [31]:
type(b.find_all(name='h1')[0])

bs4.element.Tag

In [22]:
#it's a tag
type(b.find_all(name='p')[0])

bs4.element.Tag

In [19]:
#lists the nth p tag
#it's a tag
b.find_all(name='p')[0]

<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>

In [20]:
b.find_all(name='p')[1]

<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>

In [21]:
b.find_all(name='p')[2]

<p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>

In [21]:
#tag.
tag=b.find_all(name='p')[0]

Hint: ResultSets can be sliced

In [25]:
len(b.find_all(name='p'))

5

In [24]:
b.find_all(name='p')[0]

<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>

In [26]:
b.find_all(name='p')[0].text

u'First, we are covering APIs, which are useful for getting data.'

In [27]:
b.find_all(name='p')[0]['id']

u'api'

In [22]:
# iterate over a ResultSet
results = b.find_all(name='p')
for tag in results:
    print tag.text

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Here are some helpful API resources:
Here are some helpful web scraping resources:


### Quiz: How would you write the above as a list comprenhension?

###Part II: Make a String with each tag.text separated by a new line character '\n'

In [33]:
results = b.find_all(name='p')
[tag.text for tag in b.find_all(name='p')]
[tag.text for tag in results]
[tag.text for tag in results if len(tag.text)>0]

[u'First, we are covering APIs, which are useful for getting data.',
 u'Then, we are covering web scraping, which is a more flexible way to get data.',
 u'Finally, I will ask you to fill out yet another feedback form!',
 u'Here are some helpful API resources:',
 u'Here are some helpful web scraping resources:']

In [36]:
bstring=''

for tag in results:
    bstring += tag.text + '\n'
    
print bstring

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Here are some helpful API resources:
Here are some helpful web scraping resources:



### Limit search by Tag attribute

In [37]:
#this is a dictionary

b.find(name='p', attrs={'id':'scraping'})

<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>

In [25]:
b.find_all(name='p', attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

In [24]:
b.find_all(attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

### Limit search to specific sections

In [26]:
b.find_all(name='li')

[<li>API resource 1</li>,
 <li>API resource 2</li>,
 <li>Web scraping resource 1</li>,
 <li>Web scraping resource 2</li>]

In [27]:
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')

[<li>Web scraping resource 1</li>, <li>Web scraping resource 2</li>]

## In Class Exercise

1) Find the 'h2' tag and then print its text

In [28]:
h=b.find(name='h2')

In [29]:
h

<h2>Resource List</h2>

In [52]:
h.text

u'Resource List'

2) Find the 'p' tag with an 'id' value of 'feedback' and then print its text


In [55]:
feed= b.find(name='p', attrs={'id':'feedback'})

In [56]:
print feed.text

Finally, I will ask you to fill out yet another feedback form!


3) Find the first 'p' tag and then print the value of the 'id' attribute


In [30]:
# 0 denotes first p tag
print b.find_all(name='p')[0]['id']

api


4) Print the text of all four resources

In [62]:
[tag.text for tag in b.find_all(name='li')]

[u'API resource 1',
 u'API resource 2',
 u'Web scraping resource 1',
 u'Web scraping resource 2']

In [59]:
[tag.text for tag in b.find_all(name='')]

[u'First, we are covering APIs, which are useful for getting data.',
 u'Then, we are covering web scraping, which is a more flexible way to get data.',
 u'Finally, I will ask you to fill out yet another feedback form!',
 u'Here are some helpful API resources:',
 u'Here are some helpful web scraping resources:']

5) Using a list comprehension can you extract the text of only the API resources?

In [67]:
#wrong, need to list only resource API
[tag.text for tag in b.find_all(attrs={'id':'api'})] 

#correct
[tag.text for tag in b.find_all(name='li') if "API" in tag.text]

[u'API resource 1', u'API resource 2']

### Tool: Selector Gadget
http://selectorgadget.com/

## Scraping Craigslist

#### First open your browser and look at the website and the html structure

http://www.imdb.com/title/tt0111161/

#### Get the HTML from the Shawshank Redemption page

In [52]:
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')

#### What is r? What can we do with it?

#### convert HTML into Soup

In [56]:
b = BeautifulSoup(r.text, 'html.parser')
print b

In [55]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#### Get the title

In [57]:
b.find('h1').text

u'Sisters\n                   (2015)\n                   \n'

#### Get the Star Rating (as a float)

In [72]:
# get the star rating (as a float)
float(b.find(name='span', attrs={'itemprop':'ratingValue'}).text)

9.3

#### Get the Movie Rating

In [37]:
#get movie rating
panel = b.find('meta', attrs={'itemprop':'contentRating'}) # too many

panel.text

u'R\n| \n                        2h 22min\n                    \n|\nCrime, \nDrama\n|\n14 October 1994 (USA)\n\n '

### In-Class Exercise

Intro Level: 
Using the Omdbapi, c
    

Challege Challenge Level:
Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')?
Use the function above to scrape each of the movie pages.


**Questions:**

How many of the Top movies are rated 'R'?

What is the average duration of movies with a star_rating above 9?

What is the average duration of movies before 1985 and after?

In [40]:
#Get year for 1000 IMDB movies
#Q: how to save, export data set

# read IMDb data into a DataFrame: we want a year column!
import pandas as pd
movie_all = pd.read_csv('../data/imdb_1000.csv')
movies=movie_all.head(75)

# define a function to return the year
def get_movie_year(title):
    r = requests.get('http://www.omdbapi.com/?t=' + title + '&r=json&type=movie')
    info = r.json()
    if info['Response'] == 'True':
        return int(info['Year'])
    else:
        return None
    
# write a for loop to build a list of years
#sleep is a time delay
from time import sleep
years = []
for title in movies.title:
    years.append(get_movie_year(title))
    sleep(1)    
    
movies['year'] = years

movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,year
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",1994
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",1972
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",1974
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",2008
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",1994


In [None]:
#How many of the top movies are rated R?

In [83]:
movies.content_rating.value_counts()


R            34
PG-13        14
PG            8
APPROVED      7
NOT RATED     6
G             3
UNRATED       2
PASSED        1
Name: content_rating, dtype: int64

In [None]:
#What is the average duration of movies with a star rating above 9?

In [96]:
movies.star_rating.value_counts()


8.5    23
8.6    15
8.4    12
8.7    10
8.9     6
8.8     5
9.3     1
9.1     1
9.2     1
9.0     1
Name: star_rating, dtype: int64

In [97]:
movies[movies.star_rating > 9].mean() 

star_rating       9.200000
duration        172.333333
year           1980.000000
dtype: float64

In [None]:
#What is the average duration of movies before 1985 and after?

In [99]:
movies[movies.year > 1985].mean() 

star_rating       8.625000
duration        141.000000
year           2000.636364
dtype: float64

In [98]:
movies[movies.year <= 1985].mean() 

star_rating       8.625000
duration        128.714286
year           1962.321429
dtype: float64

### Optional Wed Scraping Homework

First, define a function that accepts an IMDb ID and returns a dictionary of
movie information: title, star_rating, description, content_rating, duration.
The function should gather this information by scraping the IMDb website, not
by calling the OMDb API. (This is really just a wrapper of the web scraping
code we wrote above.)

For example, `get_movie_info('tt0111161')` should return:
```
{'content_rating': 'R',
 'description': u'Two imprisoned men bond over a number of years...',
 'duration': 142,
 'star_rating': 9.3,
 'title': u'The Shawshank Redemption'}
 ```

Then, open the file imdb_ids.txt using Python, and write a for loop that builds
a list in which each element is a dictionary of movie information.
Finally, convert that list into a DataFrame.




In [177]:
#open file
import pandas as pd
movie_ids = pd.read_csv('../data/imdb_ids.txt', delim_whitespace=True, header=None, names=['id'])
movie_ids

Unnamed: 0,id
0,tt0111161
1,tt1856010
2,tt0096694
3,tt0088763
4,tt1375666


In [179]:
#title, star_rating, description, content_rating, duration

# define a function to return the year
def get_movie_info(movieid):
    r = requests.get('http://www.imdb.com/title/' + movieid)
    b = BeautifulSoup(r.text, 'html.parser')
    
    info = {}
    
    info['id']=movieid
    
    #title
    title=b.find_all('h1')[0].text.split("\n")[0]
    info['title']=title
    
    #rating value
    star=b.find_all(attrs={'itemprop':'ratingValue'})[0].text
    info['star']=star
    
    #description
    des=b.find_all(attrs={'itemprop':'description'})[0].text.split("\n")[1]
    des=des.lstrip()
    info['des']=des
    
    #content rating
    content=b.find_all('meta', attrs={'itemprop':'contentRating'})[0].text.split("\n")[0]
    info['content']=content
    
    #duration
    dur=b.find_all('time', attrs={'itemprop':'duration'})[0].text.split("\n")[1]
    dur=dur.lstrip()
    info['dur']=dur
    
    return info

In [182]:
allinfo=[]
for movieid in movie_ids.id :
    allinfo.append(get_movie_info(movieid))
all2=pd.DataFrame(allinfo)
all2    



Unnamed: 0,content,des,dur,id,star,title
0,R,Two imprisoned men bond over a number of years...,2h 22min,tt0111161,9.3,The Shawshank Redemption
1,TV-MA,A Congressman works with his equally conniving...,51min,tt1856010,9.0,House of Cards
2,TV-PG,A TV show centered on six students and their y...,30min,tt0096694,7.0,Saved by the Bell
3,PG,A young man is accidentally sent thirty years ...,1h 56min,tt0088763,8.5,Back to the Future
4,PG-13,A thief who steals corporate secrets through u...,2h 28min,tt1375666,8.8,Inception


In [54]:
r = requests.get('http://www.imdb.com/title/tt1850457/')
b = BeautifulSoup(r.text, 'html.parser')

In [150]:
#title
#b.find_all('h1')[0].text
title=b.find_all('h1')[0].text.split("\n")[0]
#title=b.find_all('h1')[0].text.encode("utf-8").split()[0]
title

u'Sisters'

In [107]:
#rating value
b.find_all(attrs={'itemprop':'ratingValue'})[0].text

unicode

In [151]:
#description
d=b.find_all(attrs={'itemprop':'description'})[0].text.split("\n")[1]
d = d.lstrip()
d

u'Two sisters decide to throw one last house party before their parents sell their family home.'

In [161]:
#content rating
b.find_all('meta', attrs={'itemprop':'contentRating'})[0].text.split("\n")[0]

u'R'

In [165]:
info = {}


#duration
dur=b.find_all('time', attrs={'itemprop':'duration'})[0].text.split("\n")[1]
dur = dur.lstrip()
type(dur)
info['dur']=dur
info

{'dur': u'1h 58min'}