# Web Scraping 101: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

## HTML refresher

* HTML is the basic language used to create a web page. 
* It tells the web browser what text/media to display, where to display it, and how to display it (style)
* HTML is very structured/hirarchical. 
* Every page is made up of discrete "elements."

## HTML refresher

* Elements are labeled with "tags."

* For example:

    ```html
    <p>You are beginning to learn HTML.</p>
    ```

## HTML refresher

* A start tag also often contains "attributes" with info about the element.

* Attributes usually have a name and value.

* Example:

```html
<p class="my_red_sentences">You are beginning to learn HTML.</p>
```

## HTML refresher

A full HTML document has a structure more like this:

```html
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

## HTML refresher

* Let's explore some live HTML!
* Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser
* Click Inspect Element, also click on View Page Source.

## HTML to BeautifulSoup

Scrape some information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

First we need to send an HTTP request to the website in order to access the webpage information (HTML), then we can use BeautifulSoup to parse the raw HTML. 

##  HTTP
Hypertext Transfer Protocol (HTTP) is a communications protocol. It is used to send and receive webpages and files on the internet.

A web browser (or web crawler) is an HTTP client because it sends requests to an HTTP server (web server). The web server then sends responses back to the client. The stard port for HTTP servers to listen on is 80.

GET is the most common HTTP method. Essentially, it a method used by the client that says "give me this resource". We will pulling down webpages via the GET method today. Lets get started.

<img src='img/https.png'>

In [1]:
from __future__ import print_function, division

In [2]:
# if needed: conda install requests
import requests

url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

## HTML to BeautifulSoup

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [3]:
response.status_code

200

## HTML to BeautifulSoup

In [4]:
print(response.text)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1">
<title>The Big Lebowski (1998) - Box Office Mojo</title>

<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<META name="keywords" content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo">
<META name="description" content="The Big Lebowski summary of box office results, charts and release information and related links.">

<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="scre

## HTML to BeautifulSoup

In [5]:
page = response.text

In [6]:
# if needed: conda install beautifulsoup4 lxml

from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "lxml")

In [7]:
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
<title>The Big Lebowski (1998) - Box Office Mojo</title>
<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo" name="keywords"/>
<meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
<link charset="utf-8" href="/css/mojo.css?1" media="screen" rel="stylesheet" 

## HTML to BeautifulSoup

In [8]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
  <title>
   The Big Lebowski (1998) - Box Office Mojo
  </title>
  <style type="text/css">
   table.chart-wide { width: 100%; }
  </style>
  <meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo" name="keywords"/>
  <meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
  <link charset="utf-8" href="/css/mojo.css?1" media="

## `soup.find()`

* `soup.find()` is the most common function we will use from this package.  
* Let's try out some common variations of `soup.find()`

* `soup.find()` returns the first matched tag it finds.
* It searches the entire tree.

* Search for a type of tag by using the tag as a string argument ('body','div','p','a')

## `soup.find()`

In [9]:
print(soup.find('a'))

<a href="/daily/chart/">Daily Box Office (Sun.)</a>


In [10]:
# Equivalently:
print(soup.a)

<a href="/daily/chart/">Daily Box Office (Sun.)</a>


In [11]:
# Prettier:
print(soup.a.prettify())

<a href="/daily/chart/">
 Daily Box Office (Sun.)
</a>



## `soup.find_all()`

`soup.find_all()` returns a list of all matches

In [12]:
len(soup.find_all('a'))

96

In [13]:
#find and print all hyperlink elements
for link in soup.find_all('a'): 
    print(link)

<a href="/daily/chart/">Daily Box Office (Sun.)</a>
<a href="/weekend/chart/">Weekend Box Office (Jul. 6–8)</a>
<a href="/movies/?id=ant-manandthewasp.htm">#1 Movie: 'Ant-Man and the Wasp'</a>
<a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>
<a href="http://bs.serving-sys.com/BurstingPipe/adServer.bs?cn=brd&amp;FlightID=22519975&amp;Page=&amp;PluID=0&amp;Pos=534097540" target="_blank">
<img border="0" height="90" src="http://bs.serving-sys.com/BurstingPipe/adServer.bs?cn=bsr&amp;FlightID=22519975&amp;Page=&amp;PluID=0&amp;Pos=534097540" width="728"/>
</a>
<a href="/"><img alt="Box Office Mojo" height="56" src="/img/misc/bom_logo1.png" width="245"/></a>
<a href="http://pro.imdb.com/signup/index.html?rf=mojo_nb_hm&amp;ref_=mojo_nb_hm" target="_blank">
<img alt="Get industry info at IMDbPro" height="20" src="/images/IMDbPro.png"/>
</a>
<a href="http://twitter.com/boxofficemojo" target="_blank">
<img alt="Follow us on Twitter" height="18" src="/images/glyphicons-social-32-t

## Additional functionality

BeautifulSoup has a special tag object, which is what you get when you run a `find`. It lets you extract the element attributes given by the tag 

In [14]:
type(soup.find('a'))

bs4.element.Tag

In [15]:
# retrieve the url from an anchor tag (href attribute)
soup.find('a')['href']

'/daily/chart/'

## Additional functionality
* In addition to matching on tags, an attribute like id or class can be matched
* Example: 'mp_box_content' classes

In [16]:
for element in soup.find_all(class_='mp_box_content'):
    print(element, '\n')

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$17,451,873</b></td>
<td align="right" width="25%">   <b>37.8%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $28,690,764</td>
<td align="right" width="25%">   62.2%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$46,142,637</b></td>
<td width="25%"> </td>
</tr>
</table>
</div> 

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td><td> $5,533,844</td></tr>
<tr>
<td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td></tr>
<tr>
<td align="right">% of Total Gross:</td><td> 31.7%</td></tr>
<tr><td align="right" colspan="2"><font fac

## Chaining methods

All the columns in each mp_box_content table can be found by "chaining" `find_all` twice. 

`find_all` returns a beautifulsoup result set object, which acts a lot like a list. So `ll` below is a list of result sets (essentially a list of lists)

In [17]:
ll = [x.find_all('td') for x in soup.find_all(class_='mp_box_content')]
ll[0]

[<td width="40%"><b>Domestic:</b></td>,
 <td align="right" width="35%"> <b>$17,451,873</b></td>,
 <td align="right" width="25%">   <b>37.8%</b></td>,
 <td width="40%">+ Foreign:</td>,
 <td align="right" width="35%"> $28,690,764</td>,
 <td align="right" width="25%">   62.2%</td>,
 <td colspan="3" width="100%"><hr/></td>,
 <td width="40%">= <b>Worldwide:</b></td>,
 <td align="right" width="35%"> <b>$46,142,637</b></td>,
 <td width="25%"> </td>]

In [18]:
ll[1]

[<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td>,
 <td> $5,533,844</td>,
 <td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td>,
 <td align="right">% of Total Gross:</td>,
 <td> 31.7%</td>,
 <td align="right" colspan="2"><font face="Helvetica, Arial, Sans-Serif" size="1"><a href="/movies/?page=weekend&amp;id=biglebowski.htm"><b>&gt; View All 4 Weekends</b></a></font></td>,
 <td>Widest Release:</td>,
 <td> 1,235 theaters</td>]

## Chaining methods

To extract just the value of interest:

In [19]:
soup.find(class_='mp_box_content').find_all('td')[1].text[1:]

'$17,451,873'

Be careful with non-printing characters!

In [20]:
# find with an "id". (ID is unique.)

print(soup.find(id='hp_footer'))

<div id="hp_footer">
<div style="padding-bottom: 20px;">
<div style="margin: 0px 121px; vertical-align: top;">
<div id="footer_links">
<ul class="footer_link_list">
<li><strong>Latest Updates</strong></li>
<li><a href="/news/?ref=ft">Movie News</a>
</li><li><a href="/daily/chart/?ref=ft">Daily Chart</a></li>
<li><a href="/weekend/chart/?ref=ft">Weekend Chart</a></li>
<li><a href="/alltime/?ref=ft">All Time Charts</a></li>
<li><a href="/intl/?ref=ft">International Charts</a></li>
</ul>
<!--
					<ul class="footer_link_list">
						<li><strong>Popular Movies</strong></li>
											</ul>
					-->
<ul class="footer_link_list">
<li><strong>Indices</strong></li>
<li><a href="/people/?ref=ft">People</a></li>
<li><a href="/genres/?ref=ft">Genres</a></li>
<li><a href="/franchises/?ref=ft">Franchises</a></li>
<li><a href="/showdowns/?ref=ft">Showdowns</a></li>
</ul>
<ul class="footer_link_list">
<li><strong>Other</strong></li>
<li><a href="/about/?ref=ft">About This Site</a></li>
<li><a href="

## Consistency

Web scraping is made simple/tractable by the consistent format of information among different pages of the same website. By the same token, scraping can be quite challenging or nearly impossible if pages vary too much in structure. 

## Items to scrape for each movie:
* movie title
* total domestic gross
* release date
* runtime
* rating


## Movie title

In [21]:
print(soup.find('title'))

<title>The Big Lebowski (1998) - Box Office Mojo</title>


In [22]:
title_string = soup.find('title').text
print(title_string)

The Big Lebowski (1998) - Box Office Mojo


In [23]:
print(title_string.split('('))

['The Big Lebowski ', '1998) - Box Office Mojo']


In [24]:
title = title_string.split('(')[0].strip()
print(title)

The Big Lebowski


## Domestic total gross

In [25]:
#we can match on the text in an element too
print(soup.find(text="Domestic Total Gross"))

None


`Text` does an exact match search!

In [26]:
print(soup.find(text="Domestic Total Gross: "))

Domestic Total Gross: 


## Regular expressions

Solution: [regular expressions](https://xkcd.com/208/) to the rescue

![](https://imgs.xkcd.com/comics/regular_expressions.png)

## Regular expressions

In [27]:
import re
domestic_total_regex = re.compile('Domestic Total')
soup.find(text=domestic_total_regex)

'Domestic Total Gross: '

In [28]:
dtg_string = soup.find(text=re.compile('Domestic Total'))
print(dtg_string)

Domestic Total Gross: 


In [29]:
dtg_string.findNext()
      
#      .findNextSibling())

<b>$17,451,873</b>

In [30]:
dtg = dtg_string.findNextSibling().text
dtg = dtg.replace('$','').replace(',','')

domestic_total_gross = int(dtg)
print(domestic_total_gross)

17451873


## Scaling up

Make a function to scrape multiple things:

In [31]:
def get_movie_value(soup, field_name):
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and
    returns the string in the next sibling object
    (the value for that attribute)
    or None if nothing is found.
    '''
    obj = soup.find(text=re.compile(field_name))
    if not obj: 
        return None
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    if next_sibling:
        return next_sibling.text 
    else:
        return None

## Scaling up

In [32]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

$17,451,873


In [33]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)

1 hrs. 57 min.


In [34]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

R


In [35]:
a = re.compile('text')

In [36]:
a.pattern

'text'

In [37]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

March 6, 1998


## Parsing the data

In [38]:
import dateutil.parser

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

## Parsing the data

In [39]:
# Let's get these again and format them all in one swoop
# This is some nice python code: everything is essentially 
# English and it's very explicit what all the functions are doing

from pprint import pprint

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

runtime

117

## Formatting the parsed data

When you scrape data for individual pages, you'll likely want to pull it all together into a standardized format like a dataframe. One of the easiest ways to do this is to **collect the data as a list of dictionaries**, where each dictionary has keys corresponding to all of the fields you're scraping data for.

Once you have the list of dictionaries, you can just call `pd.DataFrame(dict_list)` like below, and the dictionary keys will get converted to columns with rows for all the different dictionaries.

In [46]:
import pandas as pd

# list of dictionaries is a very nice parsing format: pandas
# can automatically convert it to a dataframe
headers = ['movie title', 'domestic total gross',
           'release date', 'runtime (mins)', 'rating']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                release_date,
                                runtime,
                                rating]))
movie_data.append(movie_dict)

pprint(movie_data)

pd.DataFrame(movie_data)

[{'domestic total gross': 17451873,
  'movie title': '/movies/?id=cinemaparadiso.htm',
  'rating': 'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0),
  'runtime (mins)': 117}]


Unnamed: 0,domestic total gross,movie title,rating,release date,runtime (mins)
0,17451873,/movies/?id=cinemaparadiso.htm,R,1998-03-06,117


## Recap: Scraping Workflow / Code Setup

Here's a short summary of how you would likely develop and structure scraping code:

1. Starting with one page, write parsing functions that should generalize to any page with the same format.
2. Test your code on several different pages to confirm that it works in general. Try to make your code robust by using try;except clauses to accommodate missing elements like in the get_movie_value function above.
3. Figure out the total collection of webpages that you want to scrape, and collect the urls into a list (e.g. 2018 movies 1-100, 101-200, etc.). Iterating through all the urls, request and parse the page, adding the data to a list of dicts.
4. Incrementally convert the list of dicts into a dataframe and save it to disk with pd.to_csv().

Note that this workflow leaves out some details you may need like intentional pausing, but we'll get to these soon :) 

Also, this is a suggestion, not a prescription. You should think of project luther as a first foray into setting up a data acquisition pipeline, and it takes some intelligent and creative design to get it right. Our best advice is to start scraping early and see where things break quickly so that you can fix it! 

## Bonus: Scraping tables

In [41]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response=requests.get(url)
page=response.text

soup=BeautifulSoup(page,"lxml")

In [42]:
tables=soup.find_all("table")

rows= tables[3].find_all('tr')

# Just want to look at 1st 20 rows for now
rows=rows[1:20]

movies={}
for row in rows:
    items=row.find_all('td')
    title=items[1].find('a')['href'] #using hyperlink as dict key
    movies[title]=[i.text for i in items[1:]]
    

list(movies.items())[1]

('/movies/?id=lifeisbeautiful.htm',
 ['Life Is Beautiful(Italy)',
  'Mira.',
  '$57,563,264',
  '1,136',
  '$118,920',
  '6',
  '10/23/98'])

In [44]:
url = 'http://www.boxofficemojo.com/genres/chart/?view=main&sort=gross&order=DESC&pagenum=21&id=foreign.htm'

tables = pd.read_html(url)

In [45]:
tables[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,915,916,917,918,919,920,921,922,923,924
0,Foreign Language 1980-PresentOnly overseas-pro...,Rank,Title (click to view),Studio,Lifetime Gross / Theaters,Opening / Theaters,Date,2002,Paradise: Faith,Strand,...,TBD,Kobiety Mafii,,TBD,Raazi,,TBD,Kaala,,TBD
1,Rank,Title (click to view),Studio,Lifetime Gross / Theaters,Opening / Theaters,Date,,,,,...,,,,,,,,,,
2,2002,Paradise: Faith,Strand,"$6,508",3,"$2,179",3,8/23/13,,,...,,,,,,,,,,
3,2003,"Magdalena, The Unholy Saint",Unco,"$6,470",3,"$6,470",3,3/11/05,,,...,,,,,,,,,,
4,2004,Ek Vivah Aisa Bhi(India),Eros,"$6,437",7,"$6,437",7,11/7/08,,,...,,,,,,,,,,
5,2005,Chillar Party,UTV,"$6,330",7,"$4,155",7,7/8/11,,,...,,,,,,,,,,
6,2006,My Joy,KL,"$6,298",1,"$2,077",1,9/30/11,,,...,,,,,,,,,,
7,2007,Manuscripts Don't Burn,KL,"$6,295",2,,-,7/25/14,,,...,,,,,,,,,,
8,2008,All My Loved Ones(Czech Republic),N.Arts,"$6,237",1,"$3,222",1,8/16/02,,,...,,,,,,,,,,
9,2009,The Red Chapel,Lorb.,"$6,196",2,"$2,459",1,12/29/10,,,...,,,,,,,,,,
