Tutorial: https://programminghistorian.org/lessons/intro-to-beautiful-soup

Test to pull out only the relevant data from a single web page. Pulling multiple weeks' webpages then feeding it to a script like this will probably need another library such as urllib3 or similar.

This test notebook is currently set up to take a local .html file rather than pointing to the actual website to avoid hammering the real site with requests. The link to the specific webpage used in this test is here: http://www.the-numbers.com/box-office-chart/weekly/2014/01/10 

### Current Status
Having issues narrowing down only the data table, but keeping the `a` tags with the film name and distributor, while ignoring the other `a` tags that don't apply.

In [1]:
from bs4 import BeautifulSoup
import csv

In [2]:
# Open the local html file and store the raw html into the variable `soup`.
with open("The Numbers - Weekly Box Office Chart for January 10th, 2014.html", encoding='utf8') as infile:
    soup = BeautifulSoup(infile, "html.parser")

In [3]:
print(soup)

<!DOCTYPE html>

<!-- saved from url=(0061)http://www.the-numbers.com/box-office-chart/weekly/2014/01/10 -->
<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true for "http://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
<meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
<meta content="521546213" property="fb:admins"/>
<meta content="initial-scale=1" name="viewport"/>
<meta content="Weekly (Fri-Thu) Domestic Chart for the week starting on January 10th, 2014" name="description"/>
<meta content="NOODP" name="robots"/>
<meta content="Box Office, Chart, Domestic, Weekly, Week of January 10th, 2014" name="keywords"/>
<title>The Numbers - Weekly Box 

In [4]:
# The prettify method adds indents and makes the html slightly more human-readable.
print(soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0061)http://www.the-numbers.com/box-office-chart/weekly/2014/01/10 -->
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true for "http://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
  <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
  <meta content="telephone=no" name="format-detection"/>
  <!-- for apple mobile -->
  <meta content="521546213" property="fb:admins"/>
  <meta content="initial-scale=1" name="viewport"/>
  <meta content="Weekly (Fri-Thu) Domestic Chart for the week starting on January 10th, 2014" name="description"/>
  <meta content="NOODP" name="robots"/>
  <meta content="Box Office, Chart, Domestic, Weekly, Week of January 10th, 2014" name="keywords"/>
  <title>


In [5]:
# Method for finding all `a` tags. Pulls more links than we want, and ignores the data other than the film names.
filmNames = soup.find_all('a')

for name in filmNames:
    print(name)

<a href="http://www.the-numbers.com/"><img alt="The Numbers - Where Data and Movies Meet" border="0" height="67" src="./The Numbers - Weekly Box Office Chart for January 10th, 2014_files/the-numbers-banner.png" width="524"/>®
<br/>    Where Data and the Movie Business Meet
</a>
<a href="http://www.facebook.com/TheNumbers" target="_blank"><img height="32" src="./The Numbers - Weekly Box Office Chart for January 10th, 2014_files/facebook.png" style="border:none;" title="Follow The Numbers on Facebook" width="32"/></a>
<a href="http://www.twitter.com/MovieNumbers" target="_blank" title="Follow The Numbers on Twitter"><img height="32" src="./The Numbers - Weekly Box Office Chart for January 10th, 2014_files/twitter.png" style="border:none;" width="32"/></a>
<a href="http://www.the-numbers.com/news">News</a>
<a href="http://www.the-numbers.com/news">Latest News</a>
<a href="http://www.the-numbers.com/movies/release-schedule">Release Schedule</a>
<a href="http://www.the-numbers.com/on-this-d

In [6]:
# finalNames = soup.div.div.a
# finalNames.decompose()

# finalNames = soup.ul.li.a
# finalNames.decompose()

# filmNames = soup.find_all('a')

# for name in filmNames:
#     print(name)

In [55]:
# Shows chart data but not the name of film by looking only at tags with the class 'data'.

table = soup.find_all('td', class_='data')

for rows in table:
    print(rows)

<td class="data chart_up"><b>1</b></td>
<td class="data">(35)</td>
<td class="data">$50,428,560</td>
<td class="data chart_up">+35,989%</td>
<td class="data">2,876</td>
<td class="data chart_grey">$17,534</td>
<td class="data">  $50,810,121</td>
<td class="data">23</td>
<td class="data chart_down">2</td>
<td class="data">(1)</td>
<td class="data">$18,040,373</td>
<td class="data chart_down">-29%</td>
<td class="data">3,239</td>
<td class="data chart_grey">$5,570</td>
<td class="data">  $320,631,361</td>
<td class="data">56</td>
<td class="data chart_up"><b>3</b></td>
<td class="data">(4)</td>
<td class="data">$13,190,047</td>
<td class="data chart_down">-33%</td>
<td class="data">2,521</td>
<td class="data chart_grey">$5,232</td>
<td class="data">  $82,776,598</td>
<td class="data">23</td>
<td class="data chart_up"><b>4</b></td>
<td class="data">(5)</td>
<td class="data">$12,868,263</td>
<td class="data chart_down">-26%</td>
<td class="data">2,629</td>
<td class="data chart_grey">$4,89

In [8]:
# table = soup.find_all("tbody")

# for rows in table:
#     print(rows)

In [9]:
table = soup.find_all("td", class_="data")

for rows in table:
    print(rows)

<td class="data chart_up"><b>1</b></td>
<td class="data">(35)</td>
<td class="data">$50,428,560</td>
<td class="data chart_up">+35,989%</td>
<td class="data">2,876</td>
<td class="data chart_grey">$17,534</td>
<td class="data">  $50,810,121</td>
<td class="data">23</td>
<td class="data chart_down">2</td>
<td class="data">(1)</td>
<td class="data">$18,040,373</td>
<td class="data chart_down">-29%</td>
<td class="data">3,239</td>
<td class="data chart_grey">$5,570</td>
<td class="data">  $320,631,361</td>
<td class="data">56</td>
<td class="data chart_up"><b>3</b></td>
<td class="data">(4)</td>
<td class="data">$13,190,047</td>
<td class="data chart_down">-33%</td>
<td class="data">2,521</td>
<td class="data chart_grey">$5,232</td>
<td class="data">  $82,776,598</td>
<td class="data">23</td>
<td class="data chart_up"><b>4</b></td>
<td class="data">(5)</td>
<td class="data">$12,868,263</td>
<td class="data chart_down">-26%</td>
<td class="data">2,629</td>
<td class="data chart_grey">$4,89

<td class="data">(52)</td>
<td class="data">$6,312</td>
<td class="data chart_down">-15%</td>
<td class="data">2</td>
<td class="data chart_grey">$3,156</td>
<td class="data">  $2,412,908</td>
<td class="data">84</td>
<td class="data">50</td>
<td class="data"><b>new</b></td>
<td class="data">$5,799</td>
<td class="data"> </td>
<td class="data">1</td>
<td class="data chart_grey">$5,799</td>
<td class="data">  $5,799</td>
<td class="data">7</td>
<td class="data chart_up"><b>51</b></td>
<td class="data">(55)</td>
<td class="data">$5,432</td>
<td class="data chart_up">+29%</td>
<td class="data">5</td>
<td class="data chart_grey">$1,086</td>
<td class="data">  $679,277</td>
<td class="data">112</td>
<td class="data">52</td>
<td class="data">(-)</td>
<td class="data">$5,403</td>
<td class="data"> </td>
<td class="data">6</td>
<td class="data chart_grey">$901</td>
<td class="data">  $136,705</td>
<td class="data">105</td>
<td class="data chart_up"><b>53</b></td>
<td class="data">(62)</td>
<td

In [24]:
table = soup.find_all("td", class_="data")

for rows in table:
    print(rows)
    
table += soup.find_all("a")

print("---------------------------------------")

for rows in table:
    print(rows)

<td class="data chart_up"><b>1</b></td>
<td class="data">(35)</td>
<td class="data">$50,428,560</td>
<td class="data chart_up">+35,989%</td>
<td class="data">2,876</td>
<td class="data chart_grey">$17,534</td>
<td class="data">  $50,810,121</td>
<td class="data">23</td>
<td class="data chart_down">2</td>
<td class="data">(1)</td>
<td class="data">$18,040,373</td>
<td class="data chart_down">-29%</td>
<td class="data">3,239</td>
<td class="data chart_grey">$5,570</td>
<td class="data">  $320,631,361</td>
<td class="data">56</td>
<td class="data chart_up"><b>3</b></td>
<td class="data">(4)</td>
<td class="data">$13,190,047</td>
<td class="data chart_down">-33%</td>
<td class="data">2,521</td>
<td class="data chart_grey">$5,232</td>
<td class="data">  $82,776,598</td>
<td class="data">23</td>
<td class="data chart_up"><b>4</b></td>
<td class="data">(5)</td>
<td class="data">$12,868,263</td>
<td class="data chart_down">-26%</td>
<td class="data">2,629</td>
<td class="data chart_grey">$4,89

In [52]:
# import re
# table = soup.find_all("td", {'class' : re.compile("data"|"a")})
table = soup.find_all(["td", "a"])

# print(table)

for rows in table:
    print(rows)

print(table[55])
print(table[len(table) - 55])

<a href="http://www.the-numbers.com/"><img alt="The Numbers - Where Data and Movies Meet" border="0" height="67" src="./The Numbers - Weekly Box Office Chart for January 10th, 2014_files/the-numbers-banner.png" width="524"/>®
<br/>    Where Data and the Movie Business Meet
</a>
<a href="http://www.facebook.com/TheNumbers" target="_blank"><img height="32" src="./The Numbers - Weekly Box Office Chart for January 10th, 2014_files/facebook.png" style="border:none;" title="Follow The Numbers on Facebook" width="32"/></a>
<a href="http://www.twitter.com/MovieNumbers" target="_blank" title="Follow The Numbers on Twitter"><img height="32" src="./The Numbers - Weekly Box Office Chart for January 10th, 2014_files/twitter.png" style="border:none;" width="32"/></a>
<a href="http://www.the-numbers.com/news">News</a>
<a href="http://www.the-numbers.com/news">Latest News</a>
<a href="http://www.the-numbers.com/movies/release-schedule">Release Schedule</a>
<a href="http://www.the-numbers.com/on-this-d

In [54]:
table = soup.find_all("tbody")

for rows in table:
    print(rows)

<tbody><tr>
<td class="previous"><a href="http://www.the-numbers.com/box-office-chart/weekly/2014/01/03">← Previous Chart</a></td>
<td class="index"><a href="http://www.the-numbers.com/box-office">Chart Index</a></td>
<td class="next"><a href="http://www.the-numbers.com/box-office-chart/weekly/2014/01/17">Next Chart →</a></td>
</tr>
</tbody>
<tbody><tr><th> </th><th> </th><th>Movie</th><th>Distributor</th><th>Gross</th><th>Change</th><th>Thtrs.</th><th>Per Thtr.</th><th>Total Gross</th><th>Days</th></tr>
<tr>
<td class="data chart_up"><b>1</b></td>
<td class="data">(35)</td>
<td><b><a href="http://www.the-numbers.com/movie/Lone-Survivor#tab=box-office">Lone Survivor</a></b></td>
<td><a href="http://www.the-numbers.com/market/distributor/Universal">Universal</a></td>
<td class="data">$50,428,560</td>
<td class="data chart_up">+35,989%</td>
<td class="data">2,876</td>
<td class="data chart_grey">$17,534</td>
<td class="data">  $50,810,121</td>
<td class="data">23</td>
</tr>
<tr bgcolor="

In [53]:
infile.close()