# Web Scraping with BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

### Aquire Your Data

In [1]:
# read the HTML code for a web page and save as a string
with open('../../DAT-DC-10/data/example.html', 'rU') as f:
    html = f.read()

### Look at / Explorre Your Data (html)

In [2]:
#print html

In [3]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'html.parser')

In [4]:
# print out the object
#print b
#print b.prettify()

#### 'find' method returns the first matching Tag (and everything inside of it)

In [5]:
b.find(name='body')
b.find(name='h1')

<h1 id="main">DAT10 Class 6</h1>

In [6]:
# Tags allow you to access the 'inside text'
b.find(name='h1').text

u'DAT10 Class 6'

In [7]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

u'main'

#### 'find_all' method is useful for finding all matching Tags

In [8]:
test = b.find_all(name='p')    # returns a ResultSet (like a list of Tags)

### Quiz: What is the datatype returned by 'find_all'? What kinds of operations can we do on that datatype?

In [9]:
type(test)

bs4.element.ResultSet

In [10]:
tag1 = test[0]

In [11]:
tag1.contents

[u'First, we are covering APIs, which are useful for getting data.']

Hint: ResultSets can be sliced

In [12]:
#len(b.find_all(name='p'))
#b.find_all(name='p')[0]
#b.find_all(name='p')[0].text
b.find_all(name='p')[0]['id']

u'api'

In [13]:
# iterate over a ResultSet
results = b.find_all(name='p')
for tag in results:
    print tag.text

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Here are some helpful API resources:
Here are some helpful web scraping resources:


In [14]:
print "\n".join(tag.text for tag in results if len(tag.text) > 0)

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Here are some helpful API resources:
Here are some helpful web scraping resources:


In [15]:
print "\n".join(list_text)

NameError: name 'list_text' is not defined

### Quiz: How would you write the above as a list comprenhension?

### Limit search by Tag attribute

In [16]:
b.find(name='p', attrs={'id':'scraping'})

<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>

In [17]:
#b.find_all(name='p', attrs={'class':'topic'})
b.find_all(attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

### Limit search to specific sections

In [18]:
#b.find_all(name='li')
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')

[<li>Web scraping resource 1</li>, <li>Web scraping resource 2</li>]

## In Class Exercise

1) Find the 'h2' tag and then print its text

In [19]:
b.find(name='h2').text

u'Resource List'

In [20]:
print ", ".join(tag.text for tag in tagList)

NameError: name 'tagList' is not defined

2) Find the 'p' tag with an 'id' value of 'feedback' and then print its text


In [21]:
b.find(name='p', attrs={'id':'feedback'}).text

u'Finally, I will ask you to fill out yet another feedback form!'

3) Find the first 'p' tag and then print the value of the 'id' attribute


In [22]:
b.find(name='p')['id']

u'api'

4) Print the text of all four resources

In [23]:
", ".join(tag.text for tag in b.find_all(name='li'))

u'API resource 1, API resource 2, Web scraping resource 1, Web scraping resource 2'

5) Using a list comprehension can you extract the text of only the API resources?

In [24]:
print "\n".join(tag.text for tag in b.find(name='ul', attrs={'id':'api'}).find_all(name='li'))

API resource 1
API resource 2


### Tool: Selector Gadget
http://selectorgadget.com/

## Scraping Craigslist

#### First open your browser and look at the website and the html structure

http://www.imdb.com/title/tt0111161/

#### Get the HTML from the Shawshank Redemption page

In [25]:
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')

#### What is r? What can we do with it?

In [26]:
type(r)

requests.models.Response

In [27]:
r.

SyntaxError: invalid syntax (<ipython-input-27-ea092d2967a4>, line 1)

#### convert HTML into Soup

In [28]:
b = BeautifulSoup(r.text, 'html.parser')
print b


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8">
<meta content="IE=edge" http-equiv="X-UA-Compatible">
<script type="text/javascript">var ue_t0=window.ue_t0||+new Date();</script>
<script type="text/javascript">
                var ue_mid = "A1EVAM02EL8SFB"; 
                var ue_sn = "www.imdb.com";  
                var ue_furl = "fls-na.amazon.com";
                var ue_sid = "000-0000000-0000000";
                var ue_id = "1C6ED3N1MEXT51PTG99B";
                (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b.skipTrace=1;e.onerror=b;function f(){c.uex("ld")}if(e.addEventListener){e.addEventListener("load",f,false)}else{

In [29]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#### Get the title

In [30]:
b.find('h1').text

u'The Shawshank Redemption\n                   (1994)\n                   \n'

#### Get the Star Rating (as a float)

In [31]:
# get the star rating (as a float)
float(b.find(name='span', attrs={'itemprop':'ratingValue'}).text)

9.3

In [32]:
b.find_all(name='span', attrs={'class':'rating-rating'})

[<span class="rating-rating "><span class="value">9</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">9.2</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">8.8</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">8.9</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">8.9</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">8.8</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">9</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class="rating-rating "><span class="value">8.9</span><span class="grey">/</span><span class="grey">10</span></span>,
 <span class

#### Get the Movie Rating

In [33]:
panel = b.find('meta', attrs={'itemprop':'contentRating'}) # too many
panel.text

u'R\n| \n                        2h 22min\n                    \n|\nCrime, \nDrama\n|\n14 October 1994 (USA)\n\n '

### Optional Wed Scraping Homework

First, define a function that accepts an IMDb ID and returns a dictionary of
movie information: title, star_rating, description, content_rating, duration.
The function should gather this information by scraping the IMDb website, not
by calling the OMDb API. (This is really just a wrapper of the web scraping
code we wrote above.)

For example, `get_movie_info('tt0111161')` should return:
```
{'content_rating': 'R',
 'description': u'Two imprisoned men bond over a number of years...',
 'duration': 142,
 'star_rating': 9.3,
 'title': u'The Shawshank Redemption'}
 ```

Then, open the file imdb_ids.txt using Python, and write a for loop that builds
a list in which each element is a dictionary of movie information.
Finally, convert that list into a DataFrame.

### Bonus -- Challenge

Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')?

Use the function above to scrape each of the movie pages.

**Questions:**

How many of the Top movies are rated 'R'?

What is the average duration of movies with a star_rating above 9?

What is the average duration of movies before 1985 and after?



In [16]:
import requests
def get_movie_info(uniq_id):
    r = requests.get('http://www.imdb.com/title/' + uniq_id + '/')
    b = BeautifulSoup(r.text, 'html.parser')
    title_year = b.find('h1').text.split('\n')
    title_name = title_year[0]
    year_name = title_year[1]
    descrip = b.find('div', {'class':'summary_text'}).text.split('\n')[1].strip()
    subtext = b.find('div', {'class': 'subtext'})
    rating = subtext.find('meta', {'itemprop':'contentRating'}).text.split('\n')[0]
    duration = subtext.find('time', {'itemprop':'duration'}).text.strip()
    ratingWrapper = b.find('div', {'class':'imdbRating'})
    star_rating = ratingWrapper.find('span', {'itemprop':'ratingValue'}).text
    return {'title':title_name, 'description':descrip, 
            'content_rating':rating, 'duration':duration,
           'star_rating':star_rating}

In [66]:
get_movie_info('tt0068646')

{'content_rating': u'R',
 'description': u'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.',
 'duration': u'2h 55min',
 'star_rating': u'9.2',
 'title': u'The Godfather'}

In [75]:
import pandas as pd
with open('../../DAT-DC-10/data/imdb_ids.txt', 'r') as f:
    all_movie_info = [get_movie_info(mov_id.strip()) for mov_id in f]

In [78]:
pd.DataFrame(all_movie_info)

Unnamed: 0,content_rating,description,duration,star_rating,title
0,R,Two imprisoned men bond over a number of years...,2h 22min,9.3,The Shawshank Redemption
1,TV-MA,A Congressman works with his equally conniving...,51min,9.0,House of Cards
2,TV-PG,A TV show centered on six students and their y...,30min,7.0,Saved by the Bell
3,PG,A young man is accidentally sent thirty years ...,1h 56min,8.5,Back to the Future
4,PG-13,A thief who steals corporate secrets through u...,2h 28min,8.8,Inception


In [3]:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.imdb.com/chart/top?ref_=nv_mv_250_6')
b = BeautifulSoup(r.text, 'html.parser')

In [10]:
movie_title_columns = b.find('div', {'class':'lister'}).find('tbody',{'class':'lister-list'}).find_all('td',{'class':'titleColumn'})
len(movie_title_columns)

250

In [13]:
movie_links = [title_col.find('a')['href'].split('/')[2] for title_col in movie_title_columns]

In [17]:
imdb_250_info = [get_movie_info(mov_link.strip()) for mov_link in movie_links]

In [19]:
import pandas as pd
imdb_df = pd.DataFrame(imdb_250_info)

In [29]:
imdb_df

Unnamed: 0,content_rating,description,duration,star_rating,title
0,R,Two imprisoned men bond over a number of years...,2h 22min,9.3,The Shawshank Redemption
1,R,The aging patriarch of an organized crime dyna...,2h 55min,9.2,The Godfather
2,R,The early life and career of Vito Corleone in ...,3h 22min,9.0,The Godfather: Part II
3,PG-13,When the menace known as the Joker wreaks havo...,2h 32min,9.0,The Dark Knight
4,Not Rated,A dissenting juror in a murder trial slowly ma...,1h 36min,8.9,12 Angry Men
5,R,"In Poland during World War II, Oskar Schindler...",3h 15min,8.9,Schindler's List
6,R,"The lives of two mob hit men, a boxer, a gangs...",2h 34min,8.9,Pulp Fiction
7,Not Rated,A bounty hunting scam joins two men in an unea...,2h 41min,8.9,"The Good, the Bad and the Ugly"
8,PG-13,Gandalf and Aragorn lead the World of Men agai...,3h 21min,8.9,The Lord of the Rings: The Return of the King
9,R,"An insomniac office worker, looking for a way ...",2h 19min,8.9,Fight Club
