## Part II: Foreign and Daily Box Office

_That first notebook was getting out of hand, and it also took 40 minutes to run. So let's migrate to a new notebook!_

Now that we have the Movies table, we need to produce and populate the DailyBoxOffice and ForeignBoxOffice tables. Let's define some useful functions and connect to our database.

In [1]:
#Connection to database:
import MySQLdb as mdb
import sys

con = mdb.connect(host = 'localhost', 
                  user = 'root', 
                  passwd = 'dwdstudent2015', 
                  charset='utf8', use_unicode=True);

These are functions that originated in the other workbook that are used to expedite repitious processes like querying the database, getting HTML pages, and, of course, finding the foreign/daily data.

In [2]:
def GetHTML(URL):
    return html.fromstring((requests.get(URL,stream=True)).text,)

def SQLquery_df(query):
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(query)
    rows = cur.fetchall()
    rows
    df_from_sql = pandas.DataFrame(list(rows))
    return df_from_sql

def SQLquery_raw(query):
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(query)
    rows = cur.fetchall()
    return rows

def CleanNumber(n):
    try:
        return float((n.strip("$")).replace(",",""))
    except:
        return None
    
def CleanPercent(n):
    try:
        return (float((n.strip("%")).replace(",",""))/100)
    except:
        return None

def GetDailyBoxOffice(ID):
    url = 'http://www.boxofficemojo.com/movies/?page=daily&view=chart&id='+ID+'.htm'
    page = GetHTML(url)
    table_xp = '//*[contains(concat( " ", @class, " " ), concat( " ", "chart-wide", " " ))]//td'
    table = [i.text_content() for i in page.xpath(table_xp)]
    table = table[8:len(table)]
    table = [i for i in table if i != '']
    rows = []
    for i in range(0,len(table),10):
        row = table[i:(i+10)]
        rows.append(row)
    frame = pandas.DataFrame(rows)
    frame.rename(columns={
        0:"Day_Week",
        1:"Full_Date",
        2:"Rank",
        3:"Daily_BoxOffice",
        4:"Delta_Yesterday",
        5:"Delta_LastWeek",
        6:'Theaters',
        7:'Theaters_Avg',
        8:'Gross_ToDate',
        9:"Days_Out"
    }, inplace = True)
    try:
        frame["Gross_ToDate"] = frame["Gross_ToDate"].apply(CleanNumber)
        frame["Daily_BoxOffice"] = frame["Daily_BoxOffice"].apply(CleanNumber)
        frame["Theaters"] = frame["Theaters"].apply(CleanNumber)
        frame["Full_Date"] = pandas.to_datetime(frame["Full_Date"])
        frame["Theaters_Avg"] = frame["Theaters_Avg"].apply(CleanNumber)
        frame["Delta_Yesterday"] = frame["Delta_Yesterday"].apply(CleanPercent)
        frame["Delta_LastWeek"] = frame["Delta_LastWeek"].apply(CleanPercent)
    except KeyError:
        return frame
    return frame

def GetForeign(ID):      
    request_url = 'http://www.boxofficemojo.com/js/jscharts.js'
    referer_url = 'http://www.boxofficemojo.com/movies/?page=intl&view=bycountry&id='+ID+'.htm'
    parameters = {"Referer": referer_url}
    output = requests.get(url=referer_url, params=parameters)
    page = html.fromstring((requests.get(url=referer_url, params=parameters)).text)
    table_xp = '//b//a | //td//td//font'   #bingo
    table = page.xpath(table_xp)
    table = [i.text_content() for i in table]
    table = table[21:len(table)]
    table = [i for i in table if i != '']

    rows = []
    for i in range(0,len(table),8):
        row = table[i:(i+7)]
        rows.append(row)

    frame = pandas.DataFrame(rows)
    frame.rename(columns={
        0:"Country",
        1:"Distributor",
        2:"Release_Date",
        3:"Opening_Weekend",
        4:"Opening_Percent_Total",
        5:"C_BoxOffice",
        6:"As_Of"
    }, inplace = True)
    try:
        frame["Opening_Weekend"] = frame["Opening_Weekend"].apply(CleanNumber)
        frame["Opening_Percent_Total"] = frame["Opening_Percent_Total"].apply(CleanPercent)
        frame["C_BoxOffice"] = frame["C_BoxOffice"].apply(CleanNumber)
        frame["Release_Date"] = pandas.to_datetime(frame["Release_Date"])
    except KeyError:
        return frame
    return frame

Let's query the database and do a few tests make sure that our functions are outputting the proper, clean dataframes.

In [3]:
t_movies = SQLquery_df('''SELECT * FROM Movies.Movies''')

In [4]:
GetDailyBoxOffice(t_movies["BoxOfficeID"][1453])

Unnamed: 0,Day_Week,Full_Date,Rank,Daily_BoxOffice,Delta_Yesterday,Delta_LastWeek,Theaters,Theaters_Avg,Gross_ToDate,Days_Out
0,Fri,2009-03-27,2,9669724.0,,,2732.0,3539.0,9669724.0,1
1,Sat,2009-03-28,2,8832769.0,-0.087,,2732.0,3233.0,18502493.0,2
2,Sun,2009-03-29,2,4502272.0,-0.49,,2732.0,1648.0,23004765.0,3
3,Mon,2009-03-30,2,1346755.0,-0.701,,2732.0,493.0,24351520.0,4
4,Tue,2009-03-31,2,1278037.0,-0.051,,2732.0,468.0,25629557.0,5
5,Wed,2009-04-01,2,1059840.0,-0.171,,2732.0,388.0,26689397.0,6
6,Thu,2009-04-02,3,1000236.0,-0.056,,2732.0,366.0,27689633.0,7
7,Fri,2009-04-03,3,3749607.0,2.749,-0.612,2732.0,1372.0,31439240.0,8
8,Sat,2009-04-04,3,3834031.0,0.023,-0.566,2732.0,1403.0,35273271.0,9
9,Sun,2009-04-05,5,1898009.0,-0.505,-0.578,2732.0,695.0,37171280.0,10


In [5]:
GetForeign(t_movies["BoxOfficeID"][234])

Unnamed: 0,Country,Distributor,Release_Date,Opening_Weekend,Opening_Percent_Total,C_BoxOffice,As_Of
0,Argentina,Fox,2004-09-16,193070.0,0.327,590711.0,Final
1,Australia,Fox,2004-09-30,2848526.0,0.632,4505257.0,Final
2,Austria,Centfox,2004-11-05,,,877824.0,Final
3,Belgium,-,2004-11-03,,,1155642.0,Final
4,Bolivia,-,2004-09-30,,,56259.0,Final
5,Brazil,-,2004-09-03,,,1932995.0,Final
6,Bulgaria,Alexandra,2004-11-12,49121.0,0.317,155116.0,Final
7,Central America and the Greater Antilles,-,2004-09-17,,,673658.0,Final
8,Chile,-,2004-09-09,,,444845.0,Final
9,Colombia,-,2004-09-17,,,568436.0,Final


### Error Handling, Assessment
Although these functions do work most of the time, sometimes they fail, and we'll need to know how often they fail and why the fail in order to handle them properly. There are four different types of errors:
- **Key errors** occur when there is no valid BoxOfficeID
- **Value errors** occur when there are no foreign box office stats available for the movie (i.e. when there was no international release, which is not uncommon)
- **TypeErrors** occur when there is no valid boxofficeID for movie in the main table (when it's a NoneType)
- **ChunkedEncodingError** when we keep the connection open for too long

The first three are easy enough to account for, but the fourth will require some strategic looping (and luck).

Running the foreign box office function, we receive are 644 errors out of 3700 (~17%) which really isn't bad considering that many of movies very well may not have been released in any foreign markets, which is a completely legitimate error. The daily box office, on the other hand, appears to work 100% of the time!

These cells of code help us assess how often both functions fail and why, as well as how long the scraping process takes, **but you do not have to run them!**

In [6]:
import datetime

start = datetime.datetime.now()
error_count = 0
for i in range(len(t_movies)):
    try:
        ugh = GetForeign(t_movies["BoxOfficeID"][i])
        #print(str(i)+"\t"+t_movies["BoxOfficeID"][i]+"\t"+ugh["C_Total_Gross"][0])
    except KeyError:
        #print(str(i)+"\t"+t_movies["BoxOfficeID"][i]+"\t"+"KeyError")
        #Key errors occur when there is no valid BoxOfficeID
        error_count = error_count + 1
    except ValueError:
        #print(str(i)+"\t"+t_movies["BoxOfficeID"][i]+"\t"+"ValueError")       
        #Value errors occur when there are no foreign box office stats available for the movie
        #(i.e. when there was no international release)
        error_count = error_count + 1
    except TypeError:
        #print(str(i)+"\t"+"No ID!"+"\t"+"TypeError") 
        #TypeErrors occur when there is no valid boxofficeID for movie in the main table. (When it's a NoneType)
        error_count = error_count + 1

end = datetime.datetime.now()
print("ERROR COUNT: " + str(error_count))
print("TIME ELAPSED: " + str(end - start))

ERROR COUNT:644
TIME ELAPSED:0:07:22.744720


In [7]:
import datetime

start = datetime.datetime.now()
error_count = 0
for i in range(len(t_movies)):
    try:
        foo = GetDailyBoxOffice(t_movies["BoxOfficeID"][i])
        #print(str(i)+"\t"+t_movies["BoxOfficeID"][i]+"\t"+ugh["Full_Date"][0])
    except KeyError:
        #print(str(i)+"\t"+t_movies["BoxOfficeID"][i]+"\t"+"KeyError")
        error_count += 1
    except ValueError:
        #print(str(i)+"\t"+t_movies["BoxOfficeID"][i]+"\t"+"ValueError")  
        error_count += 1
    except TypeError:
        #print(str(i)+"\t"+"No ID!"+"\t"+"TypeError") 
        error_count += 1
        
end = datetime.datetime.now()
print("ERROR COUNT: " + str(error_count))
print("TIME ELAPSED: " + str(end - start))

ERROR COUNT:0
TIME ELAPSED:0:04:37.324868


### Databasing
We can decide how to better handle foreign box office errors (if at all) later, but we can ignore most of them because they either 1) represent legitimately missing data or 2) occur infrequently enough that it's not worth changing our functions and XPaths to account for them. 

So let's move on to databasing.

In [14]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'DailyBoxOffice'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (BoxOfficeID varchar(250),
                        Day_Week varchar(250),
                        Full_Date datetime,
                        Rank int,
                        Daily_BoxOffice float,
                        Delta_Yesterday float,
                        Delta_LastWeek float,
                        Theaters int,
                        Theaters_Avg float,
                        Gross_ToDate float,
                        Days_Out int,
                        PRIMARY KEY(BoxOfficeID, Days_Out),
                        FOREIGN KEY(BoxOfficeID) REFERENCES Movies.Movies(BoxOfficeID)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

As with so many cells in these notebooks, this might take a  long time to run. 

(If you get disconnected or receive a **ChunkedEncodingError**, just try again and hope for the best.)

In [15]:
cursor = con.cursor()
table_name = 'DailyBoxOffice'
db_name = 'Movies'

insert_query_template = '''INSERT IGNORE INTO {db}.{table}(BoxOfficeID,
                        Day_Week,
                        Full_Date,
                        Rank,
                        Daily_BoxOffice,
                        Delta_Yesterday,
                        Delta_LastWeek,
                        Theaters,
                        Theaters_Avg,
                        Gross_ToDate,
                        Days_Out)
                        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'''.format(db=db_name, table=table_name)

for j in range(0,len(t_movies)):
    frame = GetDailyBoxOffice(t_movies["BoxOfficeID"][j])
    for i in range(0,len(frame)):
        query_parameters = (t_movies["BoxOfficeID"][j],
                            frame["Day_Week"][i],
                            frame["Full_Date"][i],
                            frame["Rank"][i],
                            frame["Daily_BoxOffice"][i],
                            frame["Delta_Yesterday"][i],
                            frame["Delta_LastWeek"][i],
                            frame["Theaters"][i],
                            frame["Theaters_Avg"][i],
                            frame["Gross_ToDate"][i],
                            frame["Days_Out"][i])
        cursor.execute(insert_query_template,query_parameters)
con.commit()
cursor.close()



In [16]:
SQLquery_df('''SELECT * FROM Movies.DailyBoxOffice''')

Unnamed: 0,BoxOfficeID,Daily_BoxOffice,Day_Week,Days_Out,Delta_LastWeek,Delta_Yesterday,Full_Date,Gross_ToDate,Rank,Theaters,Theaters_Avg
0,10000bc,12506900.0,Fri,1,0.000,0.000,2008-03-07,12506900.0,1,3410,3668.0
1,10000bc,14016200.0,Sat,2,0.000,0.121,2008-03-08,26523100.0,1,3410,4110.0
2,10000bc,9344390.0,Sun,3,0.000,-0.333,2008-03-09,35867500.0,1,3410,2740.0
3,10000bc,2684500.0,Mon,4,0.000,-0.713,2008-03-10,38552000.0,1,3410,787.0
4,10000bc,2317280.0,Tue,5,0.000,-0.137,2008-03-11,40869300.0,1,3410,680.0
5,10000bc,1988380.0,Wed,6,0.000,-0.142,2008-03-12,42857600.0,1,3410,583.0
6,10000bc,1946470.0,Thu,7,0.000,-0.021,2008-03-13,44804100.0,1,3410,571.0
7,10000bc,4945910.0,Fri,8,-0.605,1.541,2008-03-14,49750000.0,2,3410,1450.0
8,10000bc,7006290.0,Sat,9,-0.500,0.417,2008-03-15,56756300.0,2,3410,2055.0
9,10000bc,4821110.0,Sun,10,-0.484,-0.312,2008-03-16,61577400.0,2,3410,1414.0


Looks perfect! Let's do foreign, now. 

In [18]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'ForeignBoxOffice'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (BoxOfficeID varchar(250),
                        Country varchar(250),
                        Distributor varchar(250),
                        Release_Date datetime,
                        Opening_Weekend float,
                        Opening_Percent_Total float,
                        C_BoxOffice float,
                        As_Of varchar(250),
                        PRIMARY KEY(BoxOfficeID, Country),
                        FOREIGN KEY(BoxOfficeID) REFERENCES Movies.Movies(BoxOfficeID)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

In [19]:
cursor = con.cursor()
table_name = 'ForeignBoxOffice'
db_name = 'Movies'

insert_query_template = '''INSERT IGNORE INTO {db}.{table}(BoxOfficeID,
                        Country,
                        Distributor,
                        Release_Date,
                        Opening_Weekend,
                        Opening_Percent_Total,
                        C_BoxOffice,
                        As_Of)
                        VALUES (%s, %s, %s, %s, %s, %s, %s, %s)'''.format(db=db_name, table=table_name)

for j in range(0,len(t_movies)):
    try:
        frame = GetForeign(t_movies["BoxOfficeID"][j])
        for i in range(0,len(frame)):
            query_parameters = (t_movies["BoxOfficeID"][j],
                       frame["Country"][i],
                       frame["Distributor"][i],
                       frame["Release_Date"][i],
                       frame["Opening_Weekend"][i],
                       frame["Opening_Percent_Total"][i],
                       frame["C_BoxOffice"][i],
                       frame["As_Of"][i])
            cursor.execute(insert_query_template,query_parameters)
    except:
        continue
con.commit()
cursor.close()



In [20]:
SQLquery_df('''SELECT * FROM Movies.ForeignBoxOffice''')

Unnamed: 0,As_Of,BoxOfficeID,C_BoxOffice,Country,Distributor,Opening_Percent_Total,Opening_Weekend,Release_Date
0,6/16/08,10000bc,2262480.0,Argentina,WB,0.218,492451.0,2008-03-06
1,5/25/08,10000bc,6276220.0,Australia,WB,0.328,2058400.0,2008-03-06
2,5/4/08,10000bc,1363340.0,Austria,WB,0.319,434353.0,2008-03-07
3,5/18/08,10000bc,2442140.0,Belgium,WB,0.253,617996.0,2008-03-12
4,5/11/08,10000bc,178173.0,Bolivia,-,0.363,64607.0,2008-03-20
5,6/8/08,10000bc,5918490.0,Brazil,WB,0.278,1646220.0,2008-03-07
6,5/11/08,10000bc,351361.0,Bulgaria,Alexandra,0.461,162012.0,2008-03-14
7,5/4/08,10000bc,178212.0,Central America,-,0.000,0.0,2008-03-14
8,5/11/08,10000bc,1222720.0,Chile,WB,0.253,309515.0,2008-03-06
9,4/20/08,10000bc,10851100.0,China,China Film,0.330,3582300.0,2008-03-21


There you have it! Next up: the IMDB data.