##Web Scraping 1: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

For Project Luther, we will be scraping information about movies from the internet. 

### First, an HTML refresher

In [2]:
# HTML is the basic language used to create a web page. 
# It tells the web browser what text/media to display, where to display it, and how to display it (style)
# HTML is very structured/hirarchical. 
# Every page is made up of discrete "elements"
# Elements are labeled with "tags"

# For example:
#      <p>You are beginning to learn HTML.</p>

# A start tag also often contains "attributes" with info about the element.
# Attributes usually have a name and value
# Example:
#       <p class="my_red_sentences">You are beginning to learn HTML.</p>


# <html> 
#   <head> </head>
#   <body>
#      <p class="red">You are beginning to learn HTML.</p>
#      <h1> This is a header </h1>
#      <a href="www.google.com"> Some link </a>
#   </body>
# </html>

###Get the HTML from a page and convert to a BeautifulSoup object

we'll start by scraping some information from [this page](http://boxofficemojo.com/movies/?id=biglebowski.htm)

In [6]:
import urllib2
import re
#!pip install beautifulsoup4
## students might have to: sudo pip install beautifulsoup4
from bs4 import BeautifulSoup

url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

[33mYou are using pip version 6.1.0, however version 6.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.3.2.tar.gz (143kB)
[K    100% |████████████████████████████████| 143kB 912kB/s 
[?25hInstalling collected packages: beautifulsoup4
  Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4-4.3.2


In [9]:
#print soup
print soup.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type">
   <title>
    The Big Lebowski (1998) - Box Office Mojo
   </title>
   <style type="text/css">
    table.chart-wide { width: 100%; }
   </style>
   <meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo" name="keywords">
    <meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description

##soup.find() 
soup.find() is the most common function we will use from this package.  
Let's try out some common variations of soup.find() 

In [11]:
# soup.find() returns the first matched tag it finds. It searches the entire tree.
# search for a type of tag by using the tag as a string (like 'body','div','p','a') as an argument.
print soup.find('a')

#Equivalently:
#print soup.a
print soup.a.prettify()

<a href="/goto.php?a=5" target="4"><font face="Verdana" size="3"><b>'Furious 7' hits $800 million worldwide... &gt;</b></font><br/></a>
<a href="/goto.php?a=5" target="4">
 <font face="Verdana" size="3">
  <b>
   'Furious 7' hits $800 million worldwide... &gt;
  </b>
 </font>
 <br/>
</a>



In [12]:
# soup.find_all() returns a list of all matches
#for link in soup.find_all('a'): 
#    print link

# retrieve the url from an anchor tag 
soup.find('a')['href']

u'/goto.php?a=5'

In [9]:
# you can match on an attribute like an id or class. 
# With your web browser (like Chrome), you can show what 
# the 'mp_box_content' classes look like on the webpage with Inspect Element

#for element in soup.find_all(class_='mp_box_content'):
#    print element, '\n'

#finding all the columns in the first mp_box_content table. Chaining find/find_all
print soup.find(class_='mp_box_content').find_all('td')

# find with an id. ID is unique.
# print soup.find(id='hp_footer')

[<td width="40%"><b>Domestic:</b></td>, <td align="right" width="35%"> <b>$17,451,873</b></td>]


In [10]:
### Do we need to take a break?

##Consistency
Web scraping is made simple by the consistent format of information among like pages of a website. 

###Items to scrape for each movie:
* movie title
* total domestic gross
* release date
* runtime
* rating


In [11]:
# Movie Title
print soup.find('title')
title_string = soup.find('title').text
print title_string
print title_string.split('(')
title = title_string.split('(')[0].strip()
print title

<title>The Big Lebowski (1998) - Box Office Mojo</title>
The Big Lebowski (1998) - Box Office Mojo
[u'The Big Lebowski ', u'1998) - Box Office Mojo']
The Big Lebowski


In [13]:
# Domestic Total Gross 
## This turns the text from an exact text match into a regex.
print soup.find(text="Domestic Total Gross: ")
## text does an exact match search!
print soup.find(text="Domestic Total Gross")

import re
dtg_string = soup.find(text=re.compile('Domestic Total'))
print dtg_string
print dtg_string.findNextSibling()
dtg = dtg_string.findNextSibling().text

dtg = dtg.replace('$','').replace(',','')
domestic_total_gross = int(dtg)

print domestic_total_gross

Domestic Total Gross: 
None
Domestic Total Gross: 
<b>$17,451,873</b>
17451873


###We can actually do several of these using the text matching method, so let's make a function for that

In [13]:
def get_movie_value(soup,field_name):
    '''
    takes a string attribute of a movie on the page and 
    returns the string in the next sibling object (the value for that attribute)
    '''
    obj = soup.find(text=re.compile(field_name))
    if not obj: 
        return None
    
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    if next_sibling:
        return next_sibling.text
        
    else:
        return None

In [14]:
#domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print dtg

#runtime
runtime = get_movie_value(soup,'Runtime')
print runtime

#rating
rating = get_movie_value(soup,'MPAA Rating')
print rating 

$17,451,873
1 hrs. 57 min.
R


In [15]:
## Break time?

###we need a few helper methods to parse the strings we've gotten

In [16]:
import dateutil.parser

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

def money_to_int(moneystring):
    moneystring = moneystring.replace('$','').replace(',','')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split() #default is whitespace
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

In [17]:
#let's get these again and format them all in one swoop
from pprint import pprint

raw_release_date = get_movie_value(soup,'Release Date')
print raw_release_date
release_date = to_date(raw_release_date)

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

headers = ['movie title','domestic total gross','release date','runtime (mins)','rating']

movie_data = []
movie_dict = dict(zip(headers,[title,
                               domestic_total_gross,
                               release_date,
                               runtime,
                               rating]))
#print movie_dict
movie_data.append(movie_dict)

pprint(movie_data)

March 6, 1998
[{'domestic total gross': 17451873,
  'movie title': u'The Big Lebowski',
  'rating': u'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0),
  'runtime (mins)': 117}]


In [18]:
## optional extra about zip fnc
a = ['a','b','c']
b = [1,2,3]
print zip(a,b)
dict(zip(a,b))

[('a', 1), ('b', 2), ('c', 3)]


{'a': 1, 'b': 2, 'c': 3}

In [89]:
## Regexes (Regular Expressions)
## We strongly enourage you to read up and practice regular expressions. It's in Dive Into Python, Chapter 7.
## http://www.diveintopython.net/regular_expressions/index.html