In [2]:
# Python 2 & 3 Compatibility
from __future__ import print_function, division

# Chapter 1: Parse HTML with BeautifulSoup

Summary of this notebook here: 
    We'll study the page that we have, using BeautifulSoup to parse structured HTML, extract the 200 top grossing movie names and their individual page links
    Then we will save everything in a flatfile

```
You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help.
```

** Beautiful Soup :** 
* sits atop an HTML or XML parser, 
* providing Pythonic idioms for iterating, searching, and modifying the parse tree
* handles text encodings automatically (always utf-8 out)
Beautiful Soup [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/):  

In [3]:
import requests

url = 'http://www.boxofficemojo.com/alltime/adjusted.htm'
response = requests.get(url)

In [4]:
response

<Response [200]>

In [10]:
# if needed: !conda install beautifulsoup4 -y

Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /Users/lingqiangkong/miniconda2/envs/py3:

The following NEW packages will be INSTALLED:

    beautifulsoup4: 4.6.0-py36_0

beautifulsoup4 100% |################################| Time: 0:00:00   3.48 MB/s


In [5]:
from bs4 import BeautifulSoup

In [6]:
soup = BeautifulSoup(response.text, 'html.parser')

## Great! Now we have a soup object, which means we could use all the soup magic to parse the html contained in the soup

## find elements from page
Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. The two most popular methods: find() and find_all().  
* soup.find( )
* soup.find_all( )  -- the most popular API, hence there is a shortcut

The two APIs are almost exactly the same.  

Let's try out some common variations of `soup.find()`

## `soup.find()`

In [7]:
# soup.find() returns the first matched tag it finds.
# It searches the entire tree.

# Search for a type of tag by using the tag as a string
# (like 'body','div','p','a') as an argument.

print(soup.find('a'))

# Equivalently:
print(soup.a)

<a href="/daily/chart/">Daily Box Office (Tue.)</a>
<a href="/daily/chart/">Daily Box Office (Tue.)</a>


In [8]:
# Let's try again: to extract the page title
soup.find('title')

<title>All Time Box Office Adjusted for Ticket Price Inflation</title>

In [9]:
# retrieve the url from an anchor tag
soup.find('a')['href']

'/daily/chart/'

In [34]:
# You can match on an attribute like an id or class.
soup.find(class_="nl_link")

<li class="nl_link">
<a href="http://pro.imdb.com/signup/index.html?rf=mojo_nb_hm&amp;ref_=mojo_nb_hm" target="_blank">
<img alt="Get industry info at IMDbPro" height="20" src="/images/IMDbPro.png">
</img></a>
</li>

In [32]:
# find with an "id". (ID is unique.)
soup.find(id=True)

<iframe frameborder="0" height="1" id="sis_pixel_sitewide" marginheight="0" marginwidth="0" style="display: none;" width="1"></iframe>

## `soup.find_all()` works just like `soup.find()`, but returns a list of all matches. (`soup.find()` only returns the first match and stop searching)

In [35]:
soup.find_all('title')

[<title>All Time Box Office Adjusted for Ticket Price Inflation</title>]

In [33]:
# You can search all links with a particular href pattern, let's find all the transformers movies
import re
soup.find_all('a', href=re.compile(r'transformer'))

[<a href="/movies/?id=transformers5.htm">#1 Movie: 'Transformers 5'</a>,
 <a href="/movies/?id=transformers2.htm"><b>Transformers: Revenge of the Fallen</b></a>,
 <a href="/movies/?id=transformers06.htm"><b>Transformers</b></a>,
 <a href="/movies/?id=transformers3.htm"><b>Transformers: Dark of the Moon</b></a>]

In [40]:
# How many times does "Box Office" appear?
# Find all the occurences of phrase "Box Office"
len(soup.find_all(string=re.compile('Box Office')))

6

In [39]:
soup.find_all(string=re.compile('Box Office'))

['All Time Box Office Adjusted for Ticket Price Inflation',
 'Daily Box Office (Tue.)',
 'Weekend Box Office (Jun. 23–25)',
 'Box Office',
 'All Time Box Office',
 ', Inc. or its affiliates. All rights reserved. Box Office Mojo and IMDb are trademarks or registered trademarks of IMDb.com, Inc. or its affiliates. ']

In [42]:
# find all the tables
# This is equavalent to soup('table') because find_all() is the most popular API
soup.find_all('table')

[<table border="0" cellpadding="0" cellspacing="0">
 <tr>
 <form action="/adjuster.php" method="POST" name="adjuster">
 <input name="returnURL" type="hidden" value="/alltime/adjusted.htm"/>
 <td valign="center">
 <font face="Verdana" size="2"><a href="/about/adjuster.htm"><b>Adjuster:</b></a></font>
 <select name="ticketyr" size="1" style="font-family: Verdana; font-size: 10pt">
 <option value="0">Actuals</option>
 <option value="1">Est. Tckts</option>
 <script language="javascript">
   for(i=2017; i>=1933; i--) {
   	document.write('<option value="' + i + '"');
 	if(i=='2017') document.write(' selected');
 	document.write('>' + i );
 	if(i=='2017') document.write(', $' + '8.84');
 	document.write('</option>');
   }
 </script>
 <option value="1929">1929</option>
 <option value="1924">1924</option>
 <option value="1910">1910</option>
 </select><input name="Go" style="font-size: 10pt; height: 22" type="submit" value="Go"/>
 </td></form></tr></table>,
 <table border="0" cellpadding="0" ce

In [43]:
len(soup('table'))

4

In [38]:
# using the shorthand here: 
# soup('table') is equaivalent to soup.find_all('table')
soup('table')[3]

<table border="0" cellpadding="5" cellspacing="1" width="95%"><tr bgcolor="#dcdcdc"><td align="center"><font size="2"><a href="/alltime/adjusted.htm?sort=rank&amp;order=ASC&amp;adjust_yr=2017&amp;p=.htm">Rank</a></font></td><td align="center"><font size="2"><a href="/alltime/adjusted.htm?sort=title&amp;order=ASC&amp;adjust_yr=2017&amp;p=.htm">Title (click to view)</a></font></td><td align="center"><font size="2"><a href="/alltime/adjusted.htm?sort=studio&amp;order=ASC&amp;adjust_yr=2017&amp;p=.htm">Studio</a></font></td><td align="center"><font size="2"><a href="/alltime/adjusted.htm?sort=adjustedgross&amp;order=ASC&amp;adjust_yr=2017&amp;p=.htm"><b>Adjusted Gross</b></a></font></td><td align="center"><font size="2"><a href="/alltime/adjusted.htm?sort=gross&amp;order=DESC&amp;adjust_yr=2017&amp;p=.htm">Unadjusted Gross</a></font></td><td align="center"><font size="2"><a href="/alltime/adjusted.htm?sort=year&amp;order=DESC&amp;adjust_yr=2017&amp;p=.htm">Year^</a></font></td></tr><tr bgc

## Chaining syntax: 
You can chain `.find_all()` or `.find()` commands together

In [45]:
# I want to select individual page link for  #1 movie gone with the wind, how do I do that? 

(
    soup.find_all('table')[3]  # find the all time grossing movie table
    .find_all('tr')[1]         # fetch the second table row (skip the header row)
    .find_all('td')[1]         # fetch the second table column
    
)

<td><font size="2"><a href="/movies/?id=gonewiththewind.htm"><b>Gone with the Wind</b></a></font></td>

In [46]:
# equavalently, I could use this short hand form
gone = soup('table')[3]('tr')[1]('td')[1]

In [47]:
# Great I have the table column element which contained the individual page url
gone

<td><font size="2"><a href="/movies/?id=gonewiththewind.htm"><b>Gone with the Wind</b></a></font></td>

In [45]:
# Extract the href information from the link
# long form: gone.find('a')['href] 
gone.a['href']

'/movies/?id=gonewiththewind.htm'

In [48]:
# Extract the movie name from the 'string' attribute
gone.string

'Gone with the Wind'

## Great now we know how to extract the page links, lets do that for all 200 movies

In [71]:
# parse all the 200 top movie links on this page
movie_urls = []
for row in soup('table')[3]('tr')[1:]:
    movie = row('td')[1]
    m_name = movie.string
    m_link = movie.a['href']
    print(m_link)
    movie_urls.append(m_link)

/movies/?id=gonewiththewind.htm
/movies/?id=starwars4.htm
/movies/?id=soundofmusic.htm
/movies/?id=et.htm
/movies/?id=titanic.htm
/movies/?id=tencommandments.htm
/movies/?id=jaws.htm
/movies/?id=doctorzhivago.htm
/movies/?id=exorcist.htm
/movies/?id=snowwhite.htm
/movies/?id=starwars7.htm
/movies/?id=101dalmations.htm
/movies/?id=starwars5.htm
/movies/?id=benhur.htm
/movies/?id=avatar.htm
/movies/?id=starwars6.htm
/movies/?id=jurassicpark.htm
/movies/?id=starwars.htm
/movies/?id=lionking.htm
/movies/?id=sting.htm
/movies/?id=raidersofthelostark.htm
/movies/?id=graduate.htm
/movies/?id=fantasia.htm
/movies/?id=jurassicpark4.htm
/movies/?id=godfather.htm
/movies/?id=forrestgump.htm
/movies/?id=marypoppins.htm
/movies/?id=grease.htm
/movies/?id=avengers11.htm
/movies/?id=thunderball.htm
/movies/?id=darkknight.htm
/movies/?id=junglebook.htm
/movies/?id=sleepingbeauty.htm
/movies/?id=ghostbusters.htm
/movies/?id=shrek2.htm
/movies/?id=butchcassidyandthesundancekid.htm
/movies/?id=lovestory.

In [72]:
len(movie_urls)

200

# Exercise: 

I noticed that all the movie urls are in the form of `/movies/?id=some_movie_name.htm`. Maybe there is an easier way to scrape all these without looking at the table rows.

In [None]:
# Can you get all the urls by matching this pattern in the link href attribute?
# hint: use regular expression, and don't forget the escape special characters

all_links = soup.find_all('a',href = re.compile('/movies/\?*id='))
for link in all_links:
    print (link['href'])

## This is much easier! Consistent formatting always means easier job scraping!

## Great! Now we got all the links for the 200 all-time top grossing movies! 
Let's save them to a file so we could visit each individual movie page later and extract information we care about.

In [None]:
# if needed: !conda install pandas -y

In [76]:
import pandas as pd

In [79]:
df = pd.Series(movie_urls)

In [81]:
df.to_csv('../data/movie_urls.txt',index=False)

In [None]:
%load '../data/movie_urls.txt'