# Web Scraping: BeautifulSoup
_Collecting data from the internet and parsing it into meaningful (often tabular) form._

### Docs

- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Installation

If not using the Metis kernel, please install the following libraries

With conda:
- `conda install beautifulsoup4 requests lxml html5lib`

With pip:
- `pip install beautifulsoup4 requests lxml html5lib`

If you have installed everything correctly, you should now be able to import these libraries. (You may need to shutdown and restart this notebook for BeautifulSoup to recognize the `lxml` and `html5lib` parsers.)

In [1]:
from bs4 import BeautifulSoup
import requests

## What is BeautifulSoup?

- Python library
- **HTML parser**:  Interprets structure of HTML file 
- Does not actually get pages from the web.  Use `requests` library for that.

<br>
<img src="images/web_scraping_pipeline.png" alt="Web Scraping Pipeline" style="width: 650px;"/>

## Intro to HTML

_Basic language used to create a webpage._

- Tells browser what, where, and how to display text, images, and other media
- Structured, hierarchical nature 
- Comprised of "elements" with properties

Example HTML element
```html
<tag-name attr1="value of attr1" attr2="value of attr2" .... attrN="value of attrN">
    Inner text of the tag
</tag-name>
```

### Tags,  `<tag-name>`

- Elements are labeled with tags
- Tells us what type of "thing" to render

    
| Tag | Use
|---          | ---|
|`h1`, `h2`, ..., `h6`| headers|
|`p`| paragraphs |
|`a`| anchors (e.g. links) |
|`div`| divisions (or sections) of a page |
|`img`| images |
|`li` | list items |


### Attributes, `attr1`

- Special properties we want this tag to have
- Typically appear as `attribute name = "attribute value"` pair

| Attribute | Use | Notes
|---   | ---       | ---|
|`href` | hyperlink reference | Clicking this element directs user to value url|
|`class`| style class | Many elements may have same class |
|`id`| unique identifier | Only one element per id! |
|`style`| extra element styling | Bad practice, use css instead |

### Inner HTML Text

- Text that appears between tags
- Often the information we want to extract during web scraping

### HTML Structure

A full HTML document has a structure similar to this:

```html
<html> 
  <head> </head>
  <body>
     <h1>This is a header</h1>
     <p style="color:red;" id="learning_paragraph">You are learning HTML</p>
     <a href="www.google.com">A link to Google</a>
  </body>
</html>
```

**QUESTIONS**
> How many elements do we have within the HTML body?  What are their tags?

> What is the inner HTML of the header element?

> What attribute(s) does the paragraph have?  And the attribute value(s)?

Saving this code as a .html file and opening it in a browser should yield:

<br>
<img src="images/example_html.png" alt="Rendering of Example HTML" style="width: 300px;" align="left"/>

## Learn to Scrape with Dummy HTML

Let's begin learning how to scrape by working with some dummy HTML, written below as a string.

In [3]:
my_html = """
<html>

<head>
</head>

<body>
    <div style="border: 1px solid">
        There isn't much in this file, except a list of to-do items. 
        <ul>
          <li>Make coffee</li>
          <li>Sweep the floor</li>
          <li>Go to the store</li>
          <li>Write BeautifulSoup lecture</li>
        </ul>
    </div>
</body>

</html>
"""

Let's take a look at this simple webpage.

In [4]:
from IPython.core.display import display, HTML
display(HTML(my_html))     # make sure Jupyter knows to display it as HTML

If we want to grab the four items on our to-do list and analyze them, we can use Beautiful Soup!

In [4]:
soup = BeautifulSoup(my_html, "html5lib")

Simply looking at `soup` isn't very useful -- it's just our HTML repeated back to us.

In [5]:
soup

<html><head>
</head>

<body>
    <div style="border: 1px solid">
        There isn't much in this file, except a list of to-do items. 
        <ul>
          <li>Make coffee</li>
          <li>Sweep the floor</li>
          <li>Go to the store</li>
          <li>Write BeautifulSoup lecture</li>
        </ul>
    </div>



</body></html>

### `.find()`

But Beautiful Soup also knows how to navigate this HTML.  We can use the `find` command to get to a specific element.

In [6]:
soup.find('li')  #Grabs the first element tagged as li

<li>Make coffee</li>

In [7]:
type(soup.find('li'))

bs4.element.Tag

`find` returns a tagged element, but we can go further and just select this element's inner HTML text.

In [8]:
soup.find('li').text

'Make coffee'

In [9]:
type(soup.find('li').text)

str

### `.find_all()`

Instead of selecting just one of our list items, we can get all of them by using `find_all`.  

This method looks for all instances matching our criteria on the entire HTML and gives us back a list.

In [16]:
list(map(lambda li:li.text, soup.find_all('li')))

['Make coffee',
 'Sweep the floor',
 'Go to the store',
 'Write BeautifulSoup lecture']

To analyze our to-do list, we probably just want the text from each tagged element.  How can we do that?

One approach is to loop through the list and apply `.text` to each element:

In [17]:
todos=[]

for element in soup.find_all('li'):
    todos.append(element.text)
    
print(todos)

['Make coffee', 'Sweep the floor', 'Go to the store', 'Write BeautifulSoup lecture']


Or we could use a list comprehension:

In [18]:
todos=[element.text for element in soup.find_all('li')]

todos

['Make coffee',
 'Sweep the floor',
 'Go to the store',
 'Write BeautifulSoup lecture']

Now we have a clean list of strings, ready for analysis!

## Scrape Select Items on a Test Webpage

Now on to a more complicated example.  Take a look at `test_webpage/page.html`.  Let's try to grab all of the article links, like the ones for Starbucks and Bitcoin.

First get the HTML and then parse it with BeautifulSoup.

In [24]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_role-playing_games')
test_html = response.text
soup = BeautifulSoup(test_html, 'html5lib') # lxml is a faster parser but more brittle

Links show up as `a` tags.  Let's just try to grab all of them.

In [25]:
soup.find_all('a')

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="external text" href="https://en.wikipedia.org/w/index.php?title=List_of_role-playing_games&amp;action=edit">improve it</a>,
 <a href="/wiki/Talk:List_of_role-playing_games" title="Talk:List of role-playing games">talk page</a>,
 <a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove these template messages</a>,
 <a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" decoding="async" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png

Uh oh!  Looks like there are some links on the sidebar and in the footer, too.  We want only the ones in the articles so we'll need a better strategy.

Digging into the source code, it turns out that each of the articles live within a `div` labeled with the class `article`.  Let's try to get those.

### `class_` and `id_` 
The `find` and `find_all` methods take optional attribute arguments so you can filter down to elements with specific attributes like classes and ids.

In [29]:
soup.find_all('table', class_='wikitable')

[<table class="wikitable" width="100%">
 <tbody><tr>
 <th width="10%">Game
 </th>
 <th width="10%">Publisher
 </th>
 <th width="10%">System
 </th>
 <th width="5%">Dates
 </th>
 <th width="15">Setting
 </th>
 <th width="50%">Notes
 </th></tr>
 <tr>
 <td><i><a href="/wiki/13th_Age" title="13th Age">13th Age</a></i>
 </td>
 <td><a href="/wiki/Pelgrane_Press" title="Pelgrane Press">Pelgrane Press</a>
 </td>
 <td><a href="/wiki/D20_System" title="D20 System">d20</a>
 </td>
 <td>2013
 </td>
 <td>Fantasy
 </td>
 <td>By Jonathan Tweet and Rob Heinsoo
 </td></tr>
 
 <tr>
 <td><i><a href="/wiki/2300_AD" title="2300 AD">2300 AD</a></i>
 </td>
 <td><a href="/wiki/Game_Designers%27_Workshop" title="Game Designers' Workshop">Game Designers' Workshop</a>
 </td>
 <td>
 </td>
 <td>1989 -
 </td>
 <td>Future of the <a class="mw-redirect" href="/wiki/Twilight_2000" title="Twilight 2000">Twilight 2000</a> universe
 </td>
 <td>Originally called Traveller 2300AD
 </td></tr>
 <tr>
 <td><i><a href="/wiki/3D%26

There are our articles!  

Each of these `div` elements are also soup objects, so we can now query these `div`s to drill down further to just the links.

In [33]:
for div in soup.find_all('table', class_='wikitable'):
    for link in div.find_all('a'):
        print(link)

<a href="/wiki/13th_Age" title="13th Age">13th Age</a>
<a href="/wiki/Pelgrane_Press" title="Pelgrane Press">Pelgrane Press</a>
<a href="/wiki/D20_System" title="D20 System">d20</a>
<a href="/wiki/2300_AD" title="2300 AD">2300 AD</a>
<a href="/wiki/Game_Designers%27_Workshop" title="Game Designers' Workshop">Game Designers' Workshop</a>
<a class="mw-redirect" href="/wiki/Twilight_2000" title="Twilight 2000">Twilight 2000</a>
<a href="/wiki/3D%26T" title="3D&amp;T">3D&amp;T</a>
<a href="/wiki/7th_Sea_(role-playing_game)" title="7th Sea (role-playing game)">7th Sea</a>
<a href="/wiki/Alderac_Entertainment_Group" title="Alderac Entertainment Group">Alderac Entertainment Group</a>
<a href="/wiki/John_Wick_Presents" title="John Wick Presents">John Wick Presents</a>
<a href="/wiki/John_Wick_(game_designer)" title="John Wick (game designer)">John Wick (game designer)</a>
<a href="/wiki/9th_Generation" title="9th Generation">9th Generation</a>
<a class="mw-redirect" href="/wiki/Aberrant_(role-

Excellent!  What if we want to print out the link text and the url it points to?


### `.get()`
The `get` method allows you access to any attribute of the element.

In [35]:
for div in soup.find_all('table', class_='wikitable'):
    for link in div.find_all('a'):
        print(f'{link.text:20s} ---> https://en.wikipedia.org{link.get("href")}')

13th Age             ---> https://en.wikipedia.org/wiki/13th_Age
Pelgrane Press       ---> https://en.wikipedia.org/wiki/Pelgrane_Press
d20                  ---> https://en.wikipedia.org/wiki/D20_System
2300 AD              ---> https://en.wikipedia.org/wiki/2300_AD
Game Designers' Workshop ---> https://en.wikipedia.org/wiki/Game_Designers%27_Workshop
Twilight 2000        ---> https://en.wikipedia.org/wiki/Twilight_2000
3D&T                 ---> https://en.wikipedia.org/wiki/3D%26T
7th Sea              ---> https://en.wikipedia.org/wiki/7th_Sea_(role-playing_game)
Alderac Entertainment Group ---> https://en.wikipedia.org/wiki/Alderac_Entertainment_Group
John Wick Presents   ---> https://en.wikipedia.org/wiki/John_Wick_Presents
John Wick (game designer) ---> https://en.wikipedia.org/wiki/John_Wick_(game_designer)
9th Generation       ---> https://en.wikipedia.org/wiki/9th_Generation
Aberrant             ---> https://en.wikipedia.org/wiki/Aberrant_(role-playing_game)
White Wolf Publishin

## Scrape the Web

So far we've used BeautifulSoup to parse our own HTML strings and files.  Now let's scrape Box Office Mojo.

First let's take a look at some source code.

- Navigate to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser, preferably Chrome
- Right click and select "Inspect"
- Alternatively, you can "View Page Source"

To retrieve the HTML for this webpage, we will use `requests`.

### `requests`

The `requests` library allows us to grab information from the web.  There are two common types of requests:
- `get` -- simply requests information, akin to putting a url in your browser
- `post` -- sends information to the website, for example, writing an email

We will be using `get` to retrieve a page's HTML.

In [36]:
url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm' 

response = requests.get(url)

The response we got back is an object that gives us access to:
- `response.text` -- the returned HTML (if any)
- `response.json` -- the returned JSON (if any), typical for APIs
- `response.status_code` -- a [code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to tell you if your request was successful or if an error occurred, 2XX indicates success while 404 means not found

In [37]:
response.status_code  #200 = success!

200

In [38]:
response.text[:1000]  #First 1000 characters of the HTML

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html lang="en">\n<head>\n<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1">\n<title>The Big Lebowski (1998) - Box Office Mojo</title>\n\n<style type="text/css">\ntable.chart-wide { width: 100%; }\n</style>\n<META name="keywords" content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo">\n<META name="description" content="The Big Lebowski summary of box office results, charts and release information and related link

In [39]:
page = response.text

### `BeautifulSoup` Basics

Now that we have the HTML, let's learn its structure by parsing with BeautifulSoup.

In [46]:
soup = BeautifulSoup(page, "lxml")

In [47]:
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
<title>The Big Lebowski (1998) - Box Office Mojo</title>
<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
<meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
<link cha

The `prettify` method turns the soup into a nicely formatted Unicode string with one tag on each line for readability.

In [42]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
  <title>
   The Big Lebowski (1998) - Box Office Mojo
  </title>
  <style type="text/css">
   table.chart-wide { width: 100%; }
  </style>
  <meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
  <meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="d

**QUESTION**

> Select the first link on the page.  Can you get the text and url associated with this link?

In [48]:
soup.find('a')

<a href="/daily/chart/">Daily Box Office (Sun.)</a>

In [49]:
soup.find('a').text

'Daily Box Office (Sun.)'

In [51]:
soup.find('a').get('href')

'/daily/chart/'

Here's an equivalent way to find the first link on this page:

In [43]:
soup.a

<a href="/daily/chart/">Daily Box Office (Sun.)</a>

In [44]:
print(soup.a.prettify())

<a href="/daily/chart/">
 Daily Box Office (Sun.)
</a>



You can also grab the link from this anchor tag:

In [45]:
soup.a['href']

'/daily/chart/'

Remember `find` gets only one match, but `find_all` retrieves all matches in a list.

In [52]:
for link in soup.find_all('a')[:5]:
    print(link)

<a href="/daily/chart/">Daily Box Office (Sun.)</a>
<a href="/weekend/chart/">Weekend Box Office (Sep. 27–29)</a>
<a href="/movies/?id=everest2019.htm">#1 Movie: 'Abominable'</a>
<a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>
<a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>


And you can match only those with a specific `id` or `class` if you'd like.  Here are all the elements labeled with the "mp_box_content" class.

In [53]:
for element in soup.find_all(class_='mp_box_content'):
    print(element, '\n')

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$18,034,458</b></td>
<td align="right" width="25%">   <b>38.6%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $28,690,764</td>
<td align="right" width="25%">   61.4%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$46,725,222</b></td>
<td width="25%"> </td>
</tr>
</table>
</div> 

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td><td> $5,533,844</td></tr>
<tr>
<td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td></tr>
<tr>
<td align="right">% of Total Gross:</td><td> 31.7%</td></tr>
<tr><td align="right" colspan="2"><font fac

It's important to remember `find` and `find_all` return BeautifulSoup elements. You can continue searching these elements, thus chaining commands together.

The first element with the "mp_box_content" class is a `div` containing a table.  Let's extract the domestic gross.

<br>
<img src="images/biglebow_table.png" alt="Big Lebowski Table" style="width: 250px;"/>

In [54]:
soup.find(class_='mp_box_content')

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$18,034,458</b></td>
<td align="right" width="25%">   <b>38.6%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $28,690,764</td>
<td align="right" width="25%">   61.4%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$46,725,222</b></td>
<td width="25%"> </td>
</tr>
</table>
</div>

In [55]:
soup.find(class_='mp_box_content').find_all('td')

[<td width="40%"><b>Domestic:</b></td>,
 <td align="right" width="35%"> <b>$18,034,458</b></td>,
 <td align="right" width="25%">   <b>38.6%</b></td>,
 <td width="40%">+ Foreign:</td>,
 <td align="right" width="35%"> $28,690,764</td>,
 <td align="right" width="25%">   61.4%</td>,
 <td colspan="3" width="100%"><hr/></td>,
 <td width="40%">= <b>Worldwide:</b></td>,
 <td align="right" width="35%"> <b>$46,725,222</b></td>,
 <td width="25%"> </td>]

Text needs to be extracted from one element at a time.  To get the domestic gross:

In [56]:
soup.find(class_='mp_box_content').find_all('td')[1].text

'\xa0$18,034,458'

Be careful about non-printing characters like the one above!

You can also find using an `id`; remember id should be unique to just one element.

In [57]:
print(soup.find(id='hp_footer'))

<div id="hp_footer">
<div style="padding-bottom: 20px;">
<div style="margin: 0px 121px; vertical-align: top;">
<div id="footer_links">
<ul class="footer_link_list">
<li><strong>Latest Updates</strong></li>
<li><a href="/news/?ref=ft">Movie News</a>
</li><li><a href="/daily/chart/?ref=ft">Daily Chart</a></li>
<li><a href="/weekend/chart/?ref=ft">Weekend Chart</a></li>
<li><a href="/alltime/?ref=ft">All Time Charts</a></li>
<li><a href="/intl/?ref=ft">International Charts</a></li>
</ul>
<!--
					<ul class="footer_link_list">
						<li><strong>Popular Movies</strong></li>
											</ul>
					-->
<ul class="footer_link_list">
<li><strong>Indices</strong></li>
<li><a href="/people/?ref=ft">People</a></li>
<li><a href="/genres/?ref=ft">Genres</a></li>
<li><a href="/franchises/?ref=ft">Franchises</a></li>
<li><a href="/showdowns/?ref=ft">Showdowns</a></li>
</ul>
<ul class="footer_link_list">
<li><strong>Other</strong></li>
<li><a href="/about/?ref=ft">About This Site</a></li>
<li><a href="

### Web Scraping Pipeline

Now that we have the basics, let's practice web scraping.  **The main goal of web scraping is to extract data by taking advantage of a site's consistent format.**  That is, the code you write for one page on a website can hopefully be used on multiple pages to gather more information automatically.

Let's create code to get the following information for the movies on Box Office Mojo:
- Movie title
- Domestic gross
- Runtime
- MPAA rating
- Release date

#### Movie Title

In [58]:
soup.find('title')

<title>The Big Lebowski (1998) - Box Office Mojo</title>

In [59]:
title_string = soup.find('title').text

title_string

'The Big Lebowski (1998) - Box Office Mojo'

In [60]:
title_string.split('(')

['The Big Lebowski ', '1998) - Box Office Mojo']

In [61]:
title = title_string.split('(')[0].strip()

title

'The Big Lebowski'

<img src="images/biglebow_info.png" alt="Big Lebowski Information" style="width: 600px;"/>

#### Domestic Gross: `.findNextSibling()`

Sometimes you can find the information you are looking for by using text matching.  But note this must be an exact match!

In [62]:
soup.find(text='Domestic Total Gross')  #does not match

In [63]:
soup.find(text='Domestic Total Gross: ')  

'Domestic Total Gross: '

Alternatively, we could use [regular expressions](https://docs.python.org/3/library/re.html).

In [64]:
import re
domestic_total_regex = re.compile('Domestic Total')
soup.find(text=domestic_total_regex)

'Domestic Total Gross: '

In [65]:
dtg_string = soup.find(text=re.compile('Domestic Total'))
print(dtg_string)

Domestic Total Gross: 


In [66]:
type(dtg_string)

bs4.element.NavigableString

The string we found is still a Beautiful Soup element. This means we can use it to navigate to the next string element, which is the actual gross in dollars.

In [67]:
dtg_string.findNextSibling()

<b>$17,451,873</b>

The `.findNextSibling()` method can be incredibly useful when the information you want to find doesn't have a obvious tag, class, id, etc.

Let's clean this value up into usable data.

In [68]:
dtg = dtg_string.findNextSibling().text
dtg = dtg.replace('$','').replace(',','')
domestic_total_gross = int(dtg)
print(domestic_total_gross)

17451873


#### Runtime, MPAA Rating, Release Date

_**STEP 1:** Create function to grab values_ 

The text matching method can also help us get runtime, rating, and release date, so let's make a reuable function.

In [69]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from Box Office Mojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [70]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

$17,451,873


In [71]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)

1 hrs. 57 min.


In [72]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

R


In [73]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

March 6, 1998


_**STEP 2:** Create helper functions to parse strings into appropriate data types_

The returned values all need a bit of formatting before we can work with this data.  Here are a few helper functions.

In [74]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

_**STEP 3:** Apply these conversions_

Let's get these values again and format them all in one swoop. (Note: Rating is already correct as a string.)

In [75]:
raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

#### Put Results in Dictionary

Now that we have results for all five quantities, we can store them in a dictionary.

In [76]:
headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
movie_data

[{'movie title': 'The Big Lebowski',
  'domestic total gross': 17451873,
  'runtime (mins)': 117,
  'rating': 'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0)}]

**QUESTION**

> Why might we want to store these data in a dictionary?  Why did we put the dictionary in a list?

### Scraping Tables

Let's take a look at the [foreign language page](http://www.boxofficemojo.com/genres/chart/?id=foreign.htm) of Box Office Mojo.  How could we pull all the data from this main page?

First request the HTML and parse it with Beautiful Soup.

In [None]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")

Now find all the tables.

In [None]:
tables = soup.find_all("table")
tables

In [None]:
len(tables)

There are 5 tables but the main one is `tables[3]`.  Let's pull all the rows from that table.

In [None]:
rows = [row for row in tables[3].find_all('tr')]  # tr tag is for rows

Each row contains the information we want but requires more parsing.

In [None]:
rows[1]

Remember: you can chain methods together to look for information!

In [None]:
rows[1].find_all('td')[1].find('a')['href']

Now grab data for the first 5 movies with a loop.

In [None]:
rows = rows[1:6] 
movies = {}

for row in rows:
    items = row.find_all('td')
    title = items[1].find('a')['href']
    movies[title] = [i.text for i in items[1:]]
    
movies

## Recap

- Beautiful Soup is a powerful HTML parser
- You can locate one element with `.find()` or all matching elements with `.find_all()`
- To select specific elements, you can filter by tags like `class` or `id` 
- You can also find items using text matching and `.findNextSibling()`

### Limitations
Beautiful Soup has its limitations though.  For example, we can't use Beautiful Soup if a page:
- Requires us to input a password
- Reveals information we want only when we interact with it
- Generates dynamically (with JavaScript) rather than statically serving HTML

For these situations we need a different tool, like **Selenium** -- coming soon!

In [4]:
from bs4 import BeautifulSoup
import requests
import re
import json
import time
import random

# Kickstarter

In [5]:
# category_id=34 tabletop = 12,249 projects
# seems to be 12 per page hence 1020+ pages
data_dir='../data/'
category_id=34

for page in range(1,5):
    url = f"https://www.kickstarter.com/discover/advanced?state=successful&category_id={category_id}&sort=end_date&seed=2616584&page={page}"
    response = requests.get(url)
    if response.status_code != 200:
        print(response.status_code)
        print(response.headers)
        print(response.text)
        sleep_for = random.randint(1, 30)
        print(f"sleeping for {sleep_for} seconds...")
        sleep(sleep_for)
        continue
    test_html = response.text
    # only write if we got a good html page
    try:
        soup = BeautifulSoup(test_html, 'lxml') # lxml is a faster parser but more brittle
    except:
        print("Parsing error:", sys.exc_info()[0])
        continue # continue loop without writing file
    filename = data_dir + f"kickstarter_cat_{f'{category_id:04}'}_pg_{f'{page:05}'}.html"
    with open(filename, 'w') as writer:
        writer.write(test_html)
    time.sleep(1)






In [10]:
for page in range(1,2):
    with open(filename, 'w') as writer:
            writer.write(test_html)

'/Users/kirill.kogan/Documents/code/learn/metis/projects/metis_fall_19_proj_02/src'

In [15]:
project_url= "https://www.kickstarter.com/projects/cochin-industrial/cochin-industrial-district-3d-printable-world-building"
try:
    response = requests.get(project_url, timeout=.1)
    with open('../data/cochin-industrial__cochin-industrial-district-3d-printable-world-building.html', 'w') as writer:
        writer.write(response.text)
except ReadTimeoutError as err:
    print(err)

NameError: name 'ReadTimeoutError' is not defined

In [4]:
headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
movie_data
soup = BeautifulSoup(test_html, 'lxml') # lxml is a faster parser but more brittle
res = soup.find_all('div', class_=re.compile('js-react-proj-card'))
print(len(res))
project_card_data = res[0].get('data-project')
json.loads(project_card)

12


{'id': 585124775,
 'photo': {'key': 'assets/026/261/043/59000e347fc9e3471dbde860c6622b1f_original.png',
  'full': 'https://ksr-ugc.imgix.net/assets/026/261/043/59000e347fc9e3471dbde860c6622b1f_original.png?ixlib=rb-2.1.0&crop=faces&w=560&h=315&fit=crop&v=1566836547&auto=format&frame=1&q=92&s=8e907b8c81077aed1f54a1b06a48d858',
  'ed': 'https://ksr-ugc.imgix.net/assets/026/261/043/59000e347fc9e3471dbde860c6622b1f_original.png?ixlib=rb-2.1.0&crop=faces&w=352&h=198&fit=crop&v=1566836547&auto=format&frame=1&q=92&s=88e130e684d0a8589ae11cb6d318cfec',
  'med': 'https://ksr-ugc.imgix.net/assets/026/261/043/59000e347fc9e3471dbde860c6622b1f_original.png?ixlib=rb-2.1.0&crop=faces&w=272&h=153&fit=crop&v=1566836547&auto=format&frame=1&q=92&s=98cbb0423b4227c756fadd45c320f335',
  'little': 'https://ksr-ugc.imgix.net/assets/026/261/043/59000e347fc9e3471dbde860c6622b1f_original.png?ixlib=rb-2.1.0&crop=faces&w=208&h=117&fit=crop&v=1566836547&auto=format&frame=1&q=92&s=7411aae2d9ca160769e0279d577daaae',
 

# Validation Framework: Simple

1. Partition data — 60% TRAIN data set, 20% VALIDATE data set, 20% TEST data set
2. Design a set of candidate models
3. CHOICE process
 - Train the set of candidate models on TRAIN data set.
 - Get SCORE of each model with a VALIDATE data set.
 - Select MODEL based on SCORE.
4. SAMPLE-SCORE process
 - Retrain selected MODEL with both TRAIN and VALIDATE data set, i.e. 80% of the ENTIRE data set.
 - Get a SAMPLE-SCORE of how our model will perform in the wild, by scoring against the TEST data set.
 - verify that this meets whatever business metric.
5. DEPLOY
 - retrain model on 100% of data and deploy

# Validation Framework: Cross Validation (K-Fold Partitioning)

_(If you're seeing a lot of variation across different validation data sets—typically small data sets)_

1. Partition data — 80% CROSS-VALIDATE data : set 20% TEST data set
2. Design a set of candidate models
3. CHOICE process
 - Generate K (let's say 5), FOLDS, i.e. partitions of CROSS-VALIDATE data set.
 - For each CANDIDATE model, get a score for a TRAIN->VALIDATE sequence such that we have used each FOLD as the VALIDATE data with all of the rest of the FOLDS combined into a TRAIN data set. e.g.:
    - _Train on all of the Ts combined. Validate against the V:_
    - **[V]**[ T ][ T ][ T ][ T ],
    - [ T ]**[V]**[ T ][ T ][ T ],
    - [ T ][ T ]**[V]**[ T ][ T ],
    - [ T ][ T ][ T ]**[V]**[ T ],
    - [ T ][ T ][ T ][ T ]**[V]**
 - Take a mean of these CROSS_VALIDATION scores for each model to generate a validation SCORE
 - Select MODEL based on SCORE.
4. SAMPLE-SCORE process
 - Retrain selected MODEL withthe CROSS-VALIDATE data set, i.e. 80% of the ENTIRE data set.
 - Get a SAMPLE-SCORE of how our model will perform in the wild, by scoring against the TEST data set.
 - verify that this meets whatever business metric.
5. DEPLOY
 - retrain model on 100% of data and deploy

# NOTE, validation is not an estimation of Generalization Error.