# Web Scraping with BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

### Aquire Your Data

In [1]:
# read the HTML code for a web page and save as a string
with open('../DAT-DC-10/data/example.html', 'rU') as f:
    html = f.read()

### Look at / Explorre Your Data (html)

In [2]:
print html

<!DOCTYPE html>
<html lang='en'>

<head>
    <title>Example Web Page</title>
</head>

<body>

    <h1 id='main'>DAT10 Class 6</h1>

    <p class='topic' id='api'>First, we are covering APIs, which are useful for getting data.</p>
    <p class='topic' id='scraping'>Then, we are covering web scraping, which is a more flexible way to get data.</p>
    <p class='topic' id='feedback'>Finally, I will ask you to fill out yet another feedback form!</p>

    <h2>Resource List</h2>

    <p>Here are some helpful API resources:</p>

    <ul id='api'>
        <li>API resource 1</li>
        <li>API resource 2</li>
    </ul>

    <p>Here are some helpful web scraping resources:</p>

    <ul id='scraping'>
        <li>Web scraping resource 1</li>
        <li>Web scraping resource 2</li>
    </ul>

</body>

</html>



In [3]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'html.parser')

In [4]:
# print out the object
# print b
# print b.prettify()

#### 'find' method returns the first matching Tag (and everything inside of it)

In [5]:
# b.find(name='body')
b.find(name='h1')


<h1 id="main">DAT10 Class 6</h1>

In [6]:
# Tags allow you to access the 'inside text'
b.find(name='h1').text

u'DAT10 Class 6'

In [7]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

u'main'

#### 'find_all' method is useful for finding all matching Tags

In [8]:
b.find_all(name='p')    # returns a ResultSet (like a list of Tags)

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>,
 <p>Here are some helpful API resources:</p>,
 <p>Here are some helpful web scraping resources:</p>]

### Quiz: What is the datatype returned by 'find_all'? What kinds of operations can we do on that datatype?

Hint: ResultSets can be sliced

In [9]:
# len(b.find_all(name='p'))
# b.find_all(name='p')[0]
# b.find_all(name='p')[0].text
# b.find_all(name='p')[0]['id']
# b.find_all(name='body')

In [10]:
# iterate over a ResultSet
results = b.find_all(name='body')
for tag in results:
    print tag.text


DAT10 Class 6
First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Resource List
Here are some helpful API resources:

API resource 1
API resource 2

Here are some helpful web scraping resources:

Web scraping resource 1
Web scraping resource 2




In [11]:
# Make a string with each tag.text seperated by a new line character '\n'

'\n'.join(tag.text for tag in b.find_all(name='p'))

u'First, we are covering APIs, which are useful for getting data.\nThen, we are covering web scraping, which is a more flexible way to get data.\nFinally, I will ask you to fill out yet another feedback form!\nHere are some helpful API resources:\nHere are some helpful web scraping resources:'

### Quiz: How would you write the above as a list comprenhension?

### Limit search by Tag attribute

In [12]:
b.find(name='p', attrs={'id':'scraping'})

<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>

In [13]:
b.find_all(name='p', attrs={'class':'topic'})
# b.find_all(attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

### Limit search to specific sections

In [14]:
b.find_all(name='li')
b.find(name='ul', attrs={'id':'scraping'}).find_next_sibling(name='li')

[<li>Web scraping resource 1</li>, <li>Web scraping resource 2</li>]

## In Class Exercise

1) Find the 'h2' tag and then print its text

In [15]:
b.find(name='h2').text

u'Resource List'

2) Find the 'p' tag with an 'id' value of 'feedback' and then print its text


In [16]:
b.find(name='p', attrs={'id':'feedback'}).text

u'Finally, I will ask you to fill out yet another feedback form!'

3) Find the first 'p' tag and then print the value of the 'id' attribute


In [17]:
b.find(name='p')['id']

u'api'

4) Print the text of all four resources

In [18]:
[tag.text for tag in b.find_all(name='li')]

[u'API resource 1',
 u'API resource 2',
 u'Web scraping resource 1',
 u'Web scraping resource 2']

5) Using a list comprehension can you extract the text of only the API resources?

In [19]:
[tag.text for tag in b.find_all(name='li') if 'API' in tag.text]

[u'API resource 1', u'API resource 2']

### Tool: Selector Gadget
http://selectorgadget.com/

## Scraping IMDb

#### First open your browser and look at the website and the html structure

http://www.imdb.com/title/tt0111161/

#### Get the HTML from the Shawshank Redemption page

In [20]:
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')

#### What is r? What can we do with it?

In [21]:
r

<Response [200]>

#### convert HTML into Soup

In [22]:
b = BeautifulSoup(r.text, 'html.parser')
# print b

In [23]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#### Get the title

In [24]:
b.find('h1').text

u'The Shawshank Redemption\n                   (1994)\n                   \n'

#### Get the Star Rating (as a float)

In [25]:
# get the star rating (as a float)
float(b.find(name='span', attrs={'itemprop':'ratingValue'}).text)


9.3

#### Get the Movie Rating

In [26]:
panel = b.find('meta', attrs={'itemprop':'contentRating'}) # too many
panel.text

u'R\n| \n                        2h 22min\n                    \n|\nCrime, \nDrama\n|\n14 October 1994 (USA)\n\n '

In [27]:
# Using the Omdbapi, request all years of the 1000 movies in the CSV. Answer the questions below.
import pandas as pd
movies = pd.read_csv('../DAT-DC-10/data/imdb_1000.csv')
top_50 = movies.head(50).copy()

import requests

def get_movie_year(title):
    r = requests.get('http://www.omdbapi.com/?t=' + title + '&r=json&type=movie')
    info = r.json()
    if info['Response'] == 'True':
        try:
            return int(info['Year'])
        except:
            'NA'

from time import sleep
years = []
for title in top_50.title:
    years.append(get_movie_year(title))
    sleep(3)

In [28]:
len(years)

50

In [29]:
top_50["Year"]=years
top_50.tail(5)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,Year
45,8.5,Terminator 2: Judgment Day,R,Action,137,"[u'Arnold Schwarzenegger', u'Linda Hamilton', ...",1991.0
46,8.5,Memento,R,Mystery,113,"[u'Guy Pearce', u'Carrie-Anne Moss', u'Joe Pan...",2000.0
47,8.5,Taare Zameen Par,PG,Drama,165,"[u'Darsheel Safary', u'Aamir Khan', u'Tanay Ch...",
48,8.5,Dr. Strangelove or: How I Learned to Stop Worr...,PG,Comedy,95,"[u'Peter Sellers', u'George C. Scott', u'Sterl...",1964.0
49,8.5,The Departed,R,Crime,151,"[u'Leonardo DiCaprio', u'Matt Damon', u'Jack N...",2006.0


In-Class Exercise

Intro Level:

Using the Omdbapi, request all years of the 1000 movies in the CSV. Answer the questions below.

Challege Challenge Level:

Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')? Use the function above to scrape each of the movie pages.

Questions:
How many of the Top movies are rated 'R'?
What is the average duration of movies with a star_rating above 9?
What is the average duration of movies before 1985 and after?

In [61]:
# Scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6)
# and return a Data frame with the movide name, rating, year and the unique movie identifier 
# ie('tt0111161'). Use the function above to scrape each of the movie pages.
import pandas as pd

r = requests.get('http://www.imdb.com/chart/top?ref_=nv_mv_250_6')
imdb250 = BeautifulSoup(r.text, 'html.parser')

# prepping list form which to pull title and year
List = [item.text for item in imdb250.find_all(name='td', attrs={'class':'titleColumn'})]
newList = [item.split('\n') for item in List]

# title
titles = [item[2][6:] for item in newList]

# year
years = [int(item[3][1:5]) for item in newList]

# rating
ratingList = [item.text for item in imdb250.find_all(name='td', attrs={'class':'ratingColumn imdbRating'})]
newList2 = [item.split('\n') for item in ratingList]
ratings = [float(item[1]) for item in newList2]

# unique identifier
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/
# imdb250.find_all(name='div', attrs={'class':'wlb_ribbon'})

idList = [str(item.attrs['data-tconst']) for item in imdb250.find_all(name='div', attrs={'class':'wlb_ribbon'})]


imdb_top250 = pd.DataFrame()
imdb_top250['Title']=titles
imdb_top250['Year']=years
imdb_top250['Rating']=ratings
imdb_top250['UniqueID']=idList
imdb_top250.tail(5)


Unnamed: 0,Title,Year,Rating,UniqueID
245,Gangs of Wasseypur,2012,8,tt1954470
246,Three Colors: Red,1994,8,tt0111495
247,Chakde! India,2007,8,tt0871510
248,La Strada,1954,8,tt0047528
249,The Graduate,1967,8,tt0061722


In [116]:
# Questions:
# How many of the Top movies are rated 'R'? 
# What is the average duration of movies with a star_rating above 9? 
# What is the average duration of movies before 1985 and after?


# imdb250 = BeautifulSoup(r.text, 'html.parser')
# imdb250.find(name='meta', attrs={'itemprop':'contentRating'}).attrs['content']

def get_movie_rating(UniqueID):
    r = requests.get('http://www.imdb.com/title/' + UniqueID)
    movie = BeautifulSoup(r.text, 'html.parser')
    try:
        return movie.find(name='meta', attrs={'itemprop':'contentRating'}).text.split('\n')[0]
    except AttributeError:
        return 'No Rating'

from time import sleep

contentRating = []
for ID in idList:
    contentRating.append(get_movie_rating(ID))
    sleep(3)

imdb_top250['ContentRating']=contentRating
imdb_top250

# get_movie_rating(imdb_top250.UniqueID[20])


Unnamed: 0,Title,Year,Rating,UniqueID,ContentRating
0,The Shawshank Redemption,1994,9.2,tt0111161,R
1,The Godfather,1972,9.2,tt0068646,R
2,The Godfather: Part II,1974,9.0,tt0071562,R
3,The Dark Knight,2008,8.9,tt0468569,PG-13
4,12 Angry Men,1957,8.9,tt0050083,Not Rated
5,Schindler's List,1993,8.9,tt0108052,R
6,Pulp Fiction,1994,8.9,tt0110912,R
7,"The Good, the Bad and the Ugly",1966,8.9,tt0060196,Not Rated
8,The Lord of the Rings: The Return of the King,2003,8.9,tt0167260,PG-13
9,Fight Club,1999,8.8,tt0137523,R


In [117]:
imdb_top250.head()

Unnamed: 0,Title,Year,Rating,UniqueID,ContentRating
0,The Shawshank Redemption,1994,9.2,tt0111161,R
1,The Godfather,1972,9.2,tt0068646,R
2,The Godfather: Part II,1974,9.0,tt0071562,R
3,The Dark Knight,2008,8.9,tt0468569,PG-13
4,12 Angry Men,1957,8.9,tt0050083,Not Rated


In [122]:
# Questions:
# How many of the Top movies are rated 'R'? 

imdb_top250.groupby('ContentRating').count() # 104 movies in TOp 250 are rated R
# imdb_top250[imdb_top250.ContentRating == 'R'].count()


Unnamed: 0_level_0,Title,Year,Rating,UniqueID
ContentRating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Approved,18,18,18,18
G,13,13,13,13
M,1,1,1,1
No Rating,2,2,2,2
Not Rated,32,32,32,32
PG,30,30,30,30
PG-13,34,34,34,34
Passed,3,3,3,3
R,104,104,104,104
TV-MA,1,1,1,1


In [123]:
# Questions:
# What is the average duration of movies with a star_rating above 9? 

def get_movie_duration(UniqueID):
    r = requests.get('http://www.imdb.com/title/' + str(UniqueID))
    movie = BeautifulSoup(r.text, 'html.parser')
    return int(movie.find(name='time', attrs={'itemprop':'duration'}).attrs['datetime'][2:-1])

from time import sleep
duration = []
for ID in imdb_top250.UniqueID:
    duration.append(get_movie_duration(ID))
    sleep(3)

imdb_top250['Duration']=duration
imdb_top250.head()


Unnamed: 0,Title,Year,Rating,UniqueID,ContentRating,Duration
0,The Shawshank Redemption,1994,9.2,tt0111161,R,142
1,The Godfather,1972,9.2,tt0068646,R,175
2,The Godfather: Part II,1974,9.0,tt0071562,R,202
3,The Dark Knight,2008,8.9,tt0468569,PG-13,152
4,12 Angry Men,1957,8.9,tt0050083,Not Rated,96


In [124]:
# average duration of movies with a star_rating above 9
imdb_top250[imdb_top250.Rating > 9].Duration.mean()


158.5

In [125]:
# Average duration of movies before 1985
imdb_top250[imdb_top250.Year < 1985].Duration.mean()

127.28571428571429

In [126]:
# Average duration of movies and after
imdb_top250[imdb_top250.Year > 1985].Duration.mean()

129.71323529411765

### Optional Wed Scraping Homework

First, define a function that accepts an IMDb ID and returns a dictionary of
movie information: title, star_rating, description, content_rating, duration.
The function should gather this information by scraping the IMDb website, not
by calling the OMDb API. (This is really just a wrapper of the web scraping
code we wrote above.)

For example, `get_movie_info('tt0111161')` should return:
```
{'content_rating': 'R',
 'description': u'Two imprisoned men bond over a number of years...',
 'duration': 142,
 'star_rating': 9.3,
 'title': u'The Shawshank Redemption'}
 ```

Then, open the file imdb_ids.txt using Python, and write a for loop that builds
a list in which each element is a dictionary of movie information.
Finally, convert that list into a DataFrame.

### Bonus -- Challenge

Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')?

Use the function above to scrape each of the movie pages.

**Questions:**

How many of the Top movies are rated 'R'?

What is the average duration of movies with a star_rating above 9?

What is the average duration of movies before 1985 and after?



In [377]:
def get_movie_title(movieID):
    r = requests.get('http://www.imdb.com/title/' + movieID)
    movie = BeautifulSoup(r.text, 'html.parser')
    try:
        return movie.find(name='title').text.split(' (')[0]
    except AttributeError:
        return 'Na'
    
def get_movie_star_rating(movieID):
    r = requests.get('http://www.imdb.com/title/' + movieID)
    movie = BeautifulSoup(r.text, 'html.parser')
    try:
        return movie.find(name='span', attrs={'itemprop':'ratingValue'}).text
    except AttributeError:
        return 'Na'
    
# the function for description was very convoluted. any suggestions for how to streamline?
def get_movie_description(movieID):
    r = requests.get('http://www.imdb.com/title/' + movieID)
    movie = BeautifulSoup(r.text, 'html.parser')
    try:
        return movie.find(name='div', attrs={'id':'title-overview-widget'}).text.split('\n                    ')[3].split('\n')[0]
    except AttributeError:
        return 'Na'
    
# def get_movie_duration(movieID):
#     r = requests.get('http://www.imdb.com/title/' + movieID)
#     movie = BeautifulSoup(r.text, 'html.parser')
#     return movie.find(name='time', attrs={'itemprop':'duration'}).text # returns hours and minutes


def get_movie_duration(UniqueID):
    r = requests.get('http://www.imdb.com/title/' + str(UniqueID))
    movie = BeautifulSoup(r.text, 'html.parser')
    try:
        return int(movie.find(name='time', attrs={'itemprop':'duration'}).attrs['datetime'][2:-1])
    except AttributeError:
        return 'Na'
    
get_movie_duration('tt1392170')


142

In [380]:
def get_movie_info(movieID):
    info = {}
    info['title'] = get_movie_title(movieID)
    info['star_rating'] = get_movie_star_rating(movieID)
    info['description'] = get_movie_description(movieID)
    info['content_rating'] = get_movie_rating(movieID)
    info['duration'] = str(get_movie_duration(movieID))

    return info
    
get_movie_info('')

{'content_rating': 'No Rating',
 'description': 'Na',
 'duration': 'Na',
 'star_rating': 'Na',
 'title': u'404 Error'}

In [381]:
# open the file imdb_ids.txt using Python
with open('../DAT-DC-10/data/imdb_ids.txt', 'rU') as f:
    imdb_ids = f.read().split('\n')

In [382]:
imdb_ids

['tt0111161', 'tt1856010', 'tt0096694', 'tt0088763', 'tt1375666', '']

In [389]:
# Write a for loop that builds a list in which each element is a dictionary of movie information 
# from time import sleep

infoList = []
for ID in range(len(imdb_ids)):
    infoList.append(get_movie_info(imdb_ids[ID]))
    sleep(3)

infoList
# get_movie_info(imdb_ids[0])


[{'content_rating': u'R',
  'description': u'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.',
  'duration': '142',
  'star_rating': u'9.3',
  'title': u'The Shawshank Redemption'},
 {'content_rating': u'TV-MA',
  'description': u'A Congressman works with his equally conniving wife to exact revenge on the people who betrayed him.',
  'duration': '51',
  'star_rating': u'9.0',
  'title': u'House of Cards'},
 {'content_rating': u'TV-PG',
  'description': u'A TV show centered on six students and their years at Bayside High School in Palisades, California.',
  'duration': '30',
  'star_rating': u'7.0',
  'title': u'Saved by the Bell'},
 {'content_rating': u'PG',
  'description': u'A young man is accidentally sent thirty years into the past in a time-traveling DeLorean invented by his friend, Dr. Emmett Brown, and must make sure his high-school-age parents unite in order to save his own existence.',
  'duration': '116',


In [390]:
# Convert list into a DataFrame

import numpy as np
from pandas import DataFrame, Series

# infoList2[0].keys()

tempDict = {}
for item in range(len(infoList)):
    tempDict[str(item)] = infoList[item]

tempDict


{'0': {'content_rating': u'R',
  'description': u'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.',
  'duration': '142',
  'star_rating': u'9.3',
  'title': u'The Shawshank Redemption'},
 '1': {'content_rating': u'TV-MA',
  'description': u'A Congressman works with his equally conniving wife to exact revenge on the people who betrayed him.',
  'duration': '51',
  'star_rating': u'9.0',
  'title': u'House of Cards'},
 '2': {'content_rating': u'TV-PG',
  'description': u'A TV show centered on six students and their years at Bayside High School in Palisades, California.',
  'duration': '30',
  'star_rating': u'7.0',
  'title': u'Saved by the Bell'},
 '3': {'content_rating': u'PG',
  'description': u'A young man is accidentally sent thirty years into the past in a time-traveling DeLorean invented by his friend, Dr. Emmett Brown, and must make sure his high-school-age parents unite in order to save his own existence.',
 

In [391]:
movieInfo = DataFrame.from_dict(tempDict, orient='index')

In [392]:
movieInfo

Unnamed: 0,duration,star_rating,description,content_rating,title
0,142,9.3,Two imprisoned men bond over a number of years...,R,The Shawshank Redemption
1,51,9.0,A Congressman works with his equally conniving...,TV-MA,House of Cards
2,30,7.0,A TV show centered on six students and their y...,TV-PG,Saved by the Bell
3,116,8.5,A young man is accidentally sent thirty years ...,PG,Back to the Future
4,148,8.8,A thief who steals corporate secrets through u...,PG-13,Inception
5,Na,Na,Na,No Rating,404 Error


## Bonus -- Challenge

Scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')?
Use the function above to scrape each of the movie pages.
Questions:
How many of the Top movies are rated 'R'?
What is the average duration of movies with a star_rating above 9?
What is the average duration of movies before 1985 and after?

In [394]:
# See above code