## Web Scraping 1: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

### First, an HTML refresher

HTML is the basic language used to create a web page. 

It tells the web browser what text/media to display, where to display it, and how to display it (style)

HTML is very structured/hirarchical. 

Every page is made up of discrete "elements."

Elements are labeled with "tags."

For example:

    <p>You are beginning to learn HTML.</p>

A start tag also often contains "attributes" with info about the element.

Attributes usually have a name and value.

Example:

    <p class="my_red_sentences">You are beginning to learn HTML.</p>

A full HTML document has a structure more like this:

```
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

Let's explore some live HTML!

Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser,
click Inspect Element, also click on View Page Source.

### Get the HTML from a page and convert to a BeautifulSoup object

We'll start by scraping some of that information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [2]:
# if needed: pip install requests
import requests

url = 'http://www.boxofficemojo.com/studio/'

response = requests.get(url)

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [3]:
response.status_code

200

In [4]:
print response.text

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<HEAD>
<TITLE>2015 Market Share and Box Office Results by Movie Studio</TITLE>

<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="screen" title="no title" charset="utf-8">
<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="print" title="no title" charset="utf-8"></head>
<body >
<iframe id="sis_pixel_sitewide" width="1" height="1" frameborder="0" marginwidth="0" marginheight="0" style="display: none;"></iframe>
<script>
    setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                params_sis3 = [
                    "d=boxofficemojo.com",
                    "cb=" + cacheBust
                ];

            (document.getElementById('sis_pixel_sitewide')).src = url_sis3 + params_sis3.join

In [5]:
page = response.text

In [6]:
# if needed: pip install beautifulsoup4

# B.S takes in a HTML doc, in the form of a string.. 
from bs4 import BeautifulSoup

soup = BeautifulSoup(page)

In [7]:
print soup

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<title>2015 Market Share and Box Office Results by Movie Studio</title>
<link charset="utf-8" href="/css/mojo.css?1" media="screen" rel="stylesheet" title="no title" type="text/css"/>
<link charset="utf-8" href="/css/mojo.css?1" media="print" rel="stylesheet" title="no title" type="text/css"/></head>
<body>
<iframe frameborder="0" height="1" id="sis_pixel_sitewide" marginheight="0" marginwidth="0" style="display: none;" width="1"></iframe>
<script>
    setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                params_sis3 = [
                    "d=boxofficemojo.com",
                    "cb=" + cacheBust
                ];

            (document.getElementById('sis_pixel_sitewide')).src = url_sis3 + params_sis3.join

In [8]:
print soup.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <title>
   2015 Market Share and Box Office Results by Movie Studio
  </title>
  <link charset="utf-8" href="/css/mojo.css?1" media="screen" rel="stylesheet" title="no title" type="text/css"/>
  <link charset="utf-8" href="/css/mojo.css?1" media="print" rel="stylesheet" title="no title" type="text/css"/>
 </head>
 <body>
  <iframe frameborder="0" height="1" id="sis_pixel_sitewide" marginheight="0" marginwidth="0" style="display: none;" width="1">
  </iframe>
  <script>
   setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                params_sis3 = [
                    "d=boxofficemojo.com",
                    "cb=" + cacheBust
                ];

            (document.getElementById('sis_pixel_sitewide')).src = url_

## `soup.find()`

`soup.find()` is the most common function we will use from this package.  

Let's try out some common variations of `soup.find()`

In [9]:
# soup.find() returns the first matched tag it finds.
# It searches the entire tree.

# Search for a type of tag by using the tag as a string
# (like 'body','div','p','a') as an argument.

print soup.find('table')

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr><td bgcolor="#8b0000" colspan="2"><img border="0" height="5" src="/images/space.gif" width="1"/></td></tr>
</table>


In [10]:
# Equivalently:
print soup.table

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr><td bgcolor="#8b0000" colspan="2"><img border="0" height="5" src="/images/space.gif" width="1"/></td></tr>
</table>


In [11]:
# Prettier:
print soup.table.prettify()

<table border="0" cellpadding="0" cellspacing="0" width="100%">
 <tr>
  <td bgcolor="#8b0000" colspan="2">
   <img border="0" height="5" src="/images/space.gif" width="1"/>
  </td>
 </tr>
</table>



In [12]:
# soup.find_all() returns a list of all matches


table = soup.find_all('table')[3]
rows = table.find_all('tr')
all_rows = []
for row in rows[1:]:
    element = row.find_all('td')
    element = [el.text.strip() for el in element]

    
    if len(element) == 6:
        all_rows.append(element)
    
print all_rows    

[[u'1', u'Universal', u'27.6%', u'$2,259.0', u'16', u'14'], [u'2', u'Buena Vista', u'17.6%', u'$1,442.0', u'12', u'8'], [u'3', u'Warner Bros.', u'16.2%', u'$1,324.5', u'27', u'20'], [u'4', u'20th Century Fox', u'10.1%', u'$828.8', u'18', u'11'], [u'5', u'Paramount', u'6.9%', u'$562.1', u'10', u'6'], [u'6', u'Sony / Columbia', u'5.8%', u'$477.8', u'13', u'9'], [u'7', u'Lionsgate', u'3.6%', u'$293.5', u'14', u'12'], [u'8', u'Weinstein Company', u'3.3%', u'$269.0', u'9', u'6'], [u'9', u'Focus Features', u'1.3%', u'$108.2', u'10', u'8'], [u'10', u'Fox Searchlight', u'1.2%', u'$95.4', u'7', u'5'], [u'11', u'Relativity', u'0.9%', u'$74.2', u'6', u'4'], [u'12', u'Sony Classics', u'0.7%', u'$56.3', u'19', u'15'], [u'13', u'A24', u'0.6%', u'$50.3', u'10', u'8'], [u'14', u'STX Entertainment', u'0.5%', u'$43.7', u'1', u'1'], [u'15', u'Open Road Films', u'0.5%', u'$41.2', u'6', u'4'], [u'16', u'Roadside Attractions', u'0.4%', u'$33.3', u'10', u'8'], [u'17', u'Broad Green Pictures', u'0.4%', u'$30.

In [13]:
import pandas as pd
studios = pd.DataFrame(all_rows)
studios.columns = ['Rank', 'Distributor', 'Market Share', 'Total Gross', 'Movies Tracked', '2015 Movies']

In [14]:
studios.head()

Unnamed: 0,Rank,Distributor,Market Share,Total Gross,Movies Tracked,2015 Movies
0,1,Universal,27.6%,"$2,259.0",16,14
1,2,Buena Vista,17.6%,"$1,442.0",12,8
2,3,Warner Bros.,16.2%,"$1,324.5",27,20
3,4,20th Century Fox,10.1%,$828.8,18,11
4,5,Paramount,6.9%,$562.1,10,6


In [15]:
s1= studios.drop('Total Gross',1)

In [16]:
s2 = s1.drop('Movies Tracked',1)

In [17]:
studio_market_share = s2.drop('2015 Movies',1)

In [18]:
studio_market_share.head()

Unnamed: 0,Rank,Distributor,Market Share
0,1,Universal,27.6%
1,2,Buena Vista,17.6%
2,3,Warner Bros.,16.2%
3,4,20th Century Fox,10.1%
4,5,Paramount,6.9%


In [19]:
studio_market_share['Market Share'] = studio_market_share['Market Share'].apply(lambda x: float(x.strip('%')) / 100.0)

In [20]:
studio_market_share

Unnamed: 0,Rank,Distributor,Market Share
0,1,Universal,0.276
1,2,Buena Vista,0.176
2,3,Warner Bros.,0.162
3,4,20th Century Fox,0.101
4,5,Paramount,0.069
5,6,Sony / Columbia,0.058
6,7,Lionsgate,0.036
7,8,Weinstein Company,0.033
8,9,Focus Features,0.013
9,10,Fox Searchlight,0.012


In [21]:
studio_ms = studio_market_share[(studio_market_share['Market Share'] > 0.05)]
studio_ms

Unnamed: 0,Rank,Distributor,Market Share
0,1,Universal,0.276
1,2,Buena Vista,0.176
2,3,Warner Bros.,0.162
3,4,20th Century Fox,0.101
4,5,Paramount,0.069
5,6,Sony / Columbia,0.058


In [22]:
#Box office mojo has abbreviations of distributor names on movie lists. 

def studio(row):
   if row['Distributor'] == 'Universal' :
      return 'Uni.'
   if row['Distributor'] == 'Buena Vista' :
      return 'BV'
   if row['Distributor'] == 'Warner Bros.' :
      return 'WB'
   if row['Distributor'] == '20th Century Fox' :
      return 'Fox'
   if row['Distributor'] == 'Paramount' :
      return 'Par.'
   if row['Distributor'] == 'Sony / Columbia' :
      return 'Sony'

In [23]:
#Box office mojo has abbreviations of distributor names on movie lists. 

def studio(word):
    if word == 'Universal' :
        return 'Uni.'
    elif word == 'Buena Vista' :
        return 'BV'
    elif word == 'Warner Bros.' :
        return 'WB'
    elif word == '20th Century Fox' :
        return 'Fox'
    elif word == 'Paramount' :
        return 'Par.'
    elif word == 'Sony / Columbia' :
        return 'Sony'
    else:
        return ''

In [24]:
studio_ms

Unnamed: 0,Rank,Distributor,Market Share
0,1,Universal,0.276
1,2,Buena Vista,0.176
2,3,Warner Bros.,0.162
3,4,20th Century Fox,0.101
4,5,Paramount,0.069
5,6,Sony / Columbia,0.058


In [25]:
studio_ms['Distributor Abbrv'] = studio_ms['Distributor'].apply(lambda x:studio(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [26]:
studio_ms

Unnamed: 0,Rank,Distributor,Market Share,Distributor Abbrv
0,1,Universal,0.276,Uni.
1,2,Buena Vista,0.176,BV
2,3,Warner Bros.,0.162,WB
3,4,20th Century Fox,0.101,Fox
4,5,Paramount,0.069,Par.
5,6,Sony / Columbia,0.058,Sony


In [39]:
studio_ms.drop('Rank', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [40]:
import pickle
with open('studios.pkl', 'w') as picklefile:
    pickle.dump(studio_ms, picklefile)