## Part III: IMDB Data
Now that we have all of the BoxOfficeMojo data scraped and organized in a database, it's time to add IMDB data on actors, directors, genres, critical scores, and more. To do this, we'll use the [OMDB API](http://www.omdbapi.com/), which 
circumvents much of the lxml scraping we might otherwise have had to do.

As always, let's start by defining some useful functions and connecting to the database.

In [1]:
import requests
from lxml import html 
import pandas
import MySQLdb as mdb
import sys

con = mdb.connect(host = 'localhost', 
                  user = 'root', 
                  passwd = 'dwdstudent2015', 
                  charset='utf8', use_unicode=True);

In [2]:
def GetHTML(URL):
    return html.fromstring((requests.get(URL,stream=True)).text,)

def SQLquery_df(query):
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(query)
    rows = cur.fetchall()
    rows
    df_from_sql = pandas.DataFrame(list(rows))
    return df_from_sql

def SQLquery_raw(query):
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(query)
    rows = cur.fetchall()
    return rows

def GetIMDB_Data(ID):
    omdb_url = 'http://www.omdbapi.com/?'
    parameters = {'i':ID}
    return requests.get(url=omdb_url,params=parameters).json()

Now let's assess the JSON data that the OMDB API outpus for a given movie in the database.

In [3]:
t_movies = SQLquery_df('''SELECT * FROM Movies.Movies''')
imdb_id = t_movies["IMDB_ID"][234]

omdb_url = 'http://www.omdbapi.com/?'
parameters = {'i':imdb_id}

requests.get(url=omdb_url,params=parameters).json()

{'Actors': 'Sanaa Lathan, Raoul Bova, Lance Henriksen, Ewen Bremner',
 'Awards': '2 wins & 4 nominations.',
 'BoxOffice': '$80,218,314.00',
 'Country': 'USA, UK, Czech Republic, Canada, Germany',
 'DVD': '25 Jan 2005',
 'Director': 'Paul W.S. Anderson',
 'Genre': 'Action, Horror, Sci-Fi',
 'Language': 'English, Italian',
 'Metascore': '29',
 'Plot': 'During an archaeological expedition on Bouvetøya Island in Antarctica, a team of archaeologists and other scientists find themselves caught up in a battle between the two legends. Soon, the team realize that only one species can win.',
 'Poster': 'https://images-na.ssl-images-amazon.com/images/M/MV5BMTU4MjIwMTcyMl5BMl5BanBnXkFtZTYwMTYwNDA3._V1_SX300.jpg',
 'Production': '20th Century Fox',
 'Rated': 'PG-13',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '5.6/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '21%'},
  {'Source': 'Metacritic', 'Value': '29/100'}],
 'Released': '13 Aug 2004',
 'Response': 'True',
 'Runtime': '101 m

That's a lot of fields!

Now, ideally, the following fields would've been included in the _original_ movie table, since they are specific to individual movies. (There's a one-to-one relationship between IMDB data and basic BoxOfficeMojo data for movies).
- Runtime
- IMDB Rating
- RottenTomatoes Rating
- Metacritic Rating
- Language
- Directors

However, because of the nature of our database's design, it would be too cumbersome to consolidate all of those fields into a single table - it would take hours and hours to scrape BoxOfficeMojo and then query the OMDB API for each of the 3700 movies. It's just too programatically inefficient. 

So instead, we'll make another table - the **Movies_IMDB table** - which will have a one-to-one relationship with the original Movies table. The tables will be joined on both BoxOfficeID and IMDB_ID. 

There are two important fields that we'll want to store in other separate, many-to-many tables:
- Genre
- Actors

(**We're gonna hold off on awards and writers because those data aren't structured in a systematic, organized way.**)

We'll get to the many-to-many tables later, but first let's get all the extra IMDB data that we need, put it in a dataframe, clean it, then move it to SQL. (This may seem like a circuitous process, but separating the IMDB querying and SQL insertion makes it easier to handle errors and clean all the data at once).

As with last time, we'll make sure to keep track of duration and errors.

In [4]:
import datetime

t_movies = SQLquery_raw(''' SELECT BoxOfficeID, IMDB_ID FROM Movies.Movies''')

start = datetime.datetime.now()

data = []
pass_count = 0

for movie in t_movies:
    if movie['IMDB_ID'] == None:
        pass
    try:
        imdb_data = GetIMDB_Data(movie['IMDB_ID'])
        imdb_data.update({'BoxOfficeID':movie['BoxOfficeID']})
        data.append(imdb_data)
        #print(imdb_data)
    except:
        pass_count += 1
        #print("PASS")
        pass
    
end = datetime.datetime.now()

print("Pass Count "+str(pass_count))
print("Succesful Entries "+str(len(data)))
print("Error Rate "+str(round(pass_count/len(data),2)))
print("Time Elapsed "+str(end-start))

Pass Count 277
Succesful Entries 3407
Error Rate 0.08
Time Elapsed 0:00:49.726834


So we had an 8% error rate which, once again, is probably low enough that we can ignore the errors, rather than restructuring the function altogether. It's also a relatively quick process, which is good! 

Let' move the data to a dataframe and begin cleaning it as necessary.

In [5]:
df = pandas.DataFrame(data)
#df

In [6]:
#Some familiar functions for cleaning:
def CleanNumber(n):
    try:
        return float((n.strip("$")).replace(",",""))
    except:
        return None
    
def CleanPercent(n):
    try:
        return (float((n.strip("%")).replace(",",""))/100)
    except:
        return None
    
def CleanRunTime(n):
    try:
        return (float(n.strip(" min")))
    except:
        return None
    
def FixNA(n):
    if n == 'N/A':
        return None
    else:
        return n
    
#Some functions we'll need for extracting ratings:
   
    #I made the Metacritic function on the off chance that some data stored in the "ratings" dictionary...
    #... was not reflected in the "Metascore" field. This ended up not being the case, so we no longer need the functino.
    
def GetMetacritic(ratings):
    for rating in ratings:
        try:
            if rating["Source"] == "Metacritic":
                return int(rating["Value"].strip('/100'))
        except:
            return None
        
def GetRottenTomatoes(ratings):
    for rating in ratings:
        if rating["Source"] == "Rotten Tomatoes":
            return int(CleanPercent(rating["Value"])*100)

In [7]:
#We'll start by dropping a few fields that we don't find useful, or already have in other tables.
    #Don't have to do this, just makes it easier to review the data.
    
df.drop(["DVD","Response","Released","Language","Year","Type","Country"],1,inplace=True)

#And now, cleaning:
df["BoxOffice"] = df["BoxOffice"].apply(CleanNumber)
df["imdbVotes"] = df["imdbVotes"].apply(CleanNumber)
df["Runtime"] = df["Runtime"].apply(CleanRunTime)
df["RottenTomatoes"] = df["Ratings"].apply(GetRottenTomatoes)
df["Metascore"] = df["Metascore"].apply(FixNA).astype(int, raise_on_error=False)
df["imdbRating"] = df["imdbRating"].astype(float, raise_on_error=False)
df

Unnamed: 0,Actors,Awards,BoxOffice,BoxOfficeID,Director,Genre,Metascore,Plot,Poster,Production,Rated,Ratings,Runtime,Title,Website,Writer,imdbID,imdbRating,imdbVotes,RottenTomatoes
0,"James McAvoy, Michael Fassbender, Jennifer Law...",13 nominations.,135729385.0,,Bryan Singer,"Action, Adventure, Sci-Fi",52,After the re-emergence of the world's first mu...,https://images-na.ssl-images-amazon.com/images...,20th Century Fox,PG-13,"[{'Value': '7.1/10', 'Source': 'Internet Movie...",144.0,X-Men: Apocalypse,https://www.facebook.com/xmenmovies,"Simon Kinberg (screenplay), Bryan Singer (stor...",tt3385516,7.1,264668.0,48.0
1,"Steven Strait, Camilla Belle, Cliff Curtis, Jo...",,94700000.0,10000bc,Roland Emmerich,"Action, Adventure, Drama",34,A prehistoric epic that follows a young mammot...,https://images-na.ssl-images-amazon.com/images...,Warner Bros. Pictures,PG-13,"[{'Value': '5.1/10', 'Source': 'Internet Movie...",109.0,"10,000 BC",http://www.10000bcmovie.com/,"Roland Emmerich, Harald Kloser",tt0443649,5.1,112045.0,8.0
2,"Jaden Smith, Will Smith, Sophie Okonedo, Zoë K...",3 wins & 8 nominations.,60522097.0,1000ae,M. Night Shyamalan,"Action, Adventure, Sci-Fi",33,A crash landing leaves Kitai Raige and his fat...,https://images-na.ssl-images-amazon.com/images...,Sony Pictures,PG-13,"[{'Value': '4.9/10', 'Source': 'Internet Movie...",100.0,After Earth,http://afterearth.com/,"Gary Whitta (screenplay), M. Night Shyamalan (...",tt1815862,4.9,165926.0,11.0
3,"Helen Mirren, Om Puri, Manish Dayal, Charlotte...",Nominated for 1 Golden Globe. Another 2 wins &...,46214579.0,100foot,Lasse Hallström,"Comedy, Drama",55,The Kadam family leaves India for France where...,https://images-na.ssl-images-amazon.com/images...,Walt Disney Pictures,PG,"[{'Value': '7.3/10', 'Source': 'Internet Movie...",122.0,The Hundred-Foot Journey,http://100footjourneymovie.com/,"Steven Knight (screenplay), Richard C. Morais ...",tt2980648,7.3,58025.0,68.0
4,"Glenn Close, Gérard Depardieu, Ioan Gruffudd, ...",Nominated for 1 Oscar. Another 1 win & 4 nomin...,65406212.0,102dalmatians,Kevin Lima,"Adventure, Comedy, Family",35,Cruella DeVil gets out of prison and goes afte...,https://images-na.ssl-images-amazon.com/images...,Buena Vista Pictures,G,"[{'Value': '4.8/10', 'Source': 'Internet Movie...",100.0,102 Dalmatians,http://disney.go.com/DisneyPictures/102dalmatians,"Dodie Smith (novel), Kristen Buckley (story), ...",tt0211181,4.8,27778.0,31.0
5,"Heath Ledger, Julia Stiles, Joseph Gordon-Levi...",2 wins & 12 nominations.,,10thingsihateaboutyou,Gil Junger,"Comedy, Drama, Romance",70,"A pretty, popular teenager can't go out on a d...",https://images-na.ssl-images-amazon.com/images...,Buena Vista Pictures,PG-13,"[{'Value': '7.2/10', 'Source': 'Internet Movie...",97.0,10 Things I Hate About You,,"Karen McCullah, Kirsten Smith",tt0147800,7.2,236553.0,61.0
6,"Charles Bronson, Lisa Eilbacher, Andrew Steven...",,,10tomidnight,J. Lee Thompson,"Crime, Drama, Thriller",,A LAPD detective is on the trail of a very han...,https://images-na.ssl-images-amazon.com/images...,MGM,R,"[{'Value': '6.3/10', 'Source': 'Internet Movie...",101.0,10 to Midnight,,William Roberts,tt0085121,6.3,4627.0,40.0
7,"Jennifer Garner, Mark Ruffalo, Judy Greer, And...",11 nominations.,54600000.0,13goingon30,Gary Winick,"Comedy, Fantasy, Romance",57,A girl makes a wish on her 13th birthday and w...,https://images-na.ssl-images-amazon.com/images...,Sony Pictures,PG-13,"[{'Value': '6.1/10', 'Source': 'Internet Movie...",98.0,13 Going on 30,http://www.sonypictures.com/movies/13goingon30...,"Josh Goldsmith, Cathy Yuspa",tt0337563,6.1,127865.0,64.0
8,"Antonio Banderas, Diane Venora, Dennis Storhøi...",2 wins & 2 nominations.,,13thwarrior,"John McTiernan, Michael Crichton","Action, Adventure, History",42,"A man, having fallen in love with the wrong wo...",https://images-na.ssl-images-amazon.com/images...,Buena Vista Pictures,R,"[{'Value': '6.6/10', 'Source': 'Internet Movie...",102.0,The 13th Warrior,,"Michael Crichton (novel), William Wisher Jr. (...",tt0120657,6.6,103964.0,33.0
9,"John Cusack, Paul Birchard, Margot Leicester, ...",4 wins & 9 nominations.,71912310.0,1408,Mikael Håfström,"Fantasy, Horror",64,A man who specializes in debunking paranormal ...,https://images-na.ssl-images-amazon.com/images...,MGM/Dimension,PG-13,"[{'Value': '6.8/10', 'Source': 'Internet Movie...",104.0,1408,http://www.1408-themovie.com/,"Matt Greenberg (screenplay), Scott Alexander (...",tt0450385,6.8,219621.0,79.0


In [8]:
df.columns

Index(['Actors', 'Awards', 'BoxOffice', 'BoxOfficeID', 'Director', 'Genre',
       'Metascore', 'Plot', 'Poster', 'Production', 'Rated', 'Ratings',
       'Runtime', 'Title', 'Website', 'Writer', 'imdbID', 'imdbRating',
       'imdbVotes', 'RottenTomatoes'],
      dtype='object')

Now that the data is clean, we can input it into the Movies_IMDB table, which we'll later use to populate the many-to-many tables.

In [16]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'Movies_IMDB'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (IMDB_ID varchar(250),
                        BoxOfficeID varchar(250),
                        IMDB_Title varchar(250),
                        IMDB_BoxOffice float,
                        Plot varchar(250),
                        Website varchar(250),
                        Poster varchar(250),
                        Runtime int,
                        IMDB_Rating float,
                        RottenTomatoes_Rating int,
                        Metacritic_Rating int,
                        Director varchar(250),
                        Awards varchar(250),
                        IMDB_Votes int,
                        PRIMARY KEY(IMDB_ID),
                        FOREIGN KEY(BoxOfficeID) REFERENCES Movies.Movies(BoxOfficeID)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

In [17]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'Movies_IMDB'

insert_query_template = '''INSERT IGNORE INTO {db}.{table}(IMDB_ID,
                        BoxOfficeID,
                        IMDB_Title,
                        IMDB_BoxOffice,
                        Plot,
                        Website,
                        Poster,
                        Runtime,
                        IMDB_Rating,
                        RottenTomatoes_Rating,
                        Metacritic_Rating,
                        Director,
                        Awards,
                        IMDB_Votes)
                        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'''.format(db=db_name, table=table_name)
for i in range(len(df)):
    query_parameters = (df["imdbID"][i],
                        df["BoxOfficeID"][i],
                        df["Title"][i],
                        df["BoxOffice"][i],
                        df["Plot"][i],
                        df["Website"][i],
                        df["Poster"][i],
                        df["Runtime"][i],
                        df["imdbRating"][i],
                        df["RottenTomatoes"][i],
                        df["Metascore"][i],
                        df["Director"][i],
                        df["Awards"][i],
                        df["imdbVotes"][i])
    cursor.execute(insert_query_template, query_parameters)
con.commit()
cursor.close()





In [18]:
SQLquery_df('''SELECT * FROM Movies.Movies_IMDB''')

Unnamed: 0,Awards,BoxOfficeID,Director,IMDB_BoxOffice,IMDB_ID,IMDB_Rating,IMDB_Title,IMDB_Votes,Metacritic_Rating,Plot,Poster,RottenTomatoes_Rating,Runtime,Website
0,Nominated for 2 Oscars. Another 3 wins & 1 nom...,gold,Charles Chaplin,0.0,tt0015864,8.2,The Gold Rush,72459,,A prospector goes to the Klondike in search of...,https://images-na.ssl-images-amazon.com/images...,100,95,
1,1 win & 3 nominations.,tarzantheapeman,W.S. Van Dyke,0.0,tt0023551,7.2,Tarzan the Ape Man,5739,,A trader and his daughter set off in search of...,https://images-na.ssl-images-amazon.com/images...,100,100,
2,Won 2 Oscars. Another 7 wins & 14 nominations.,wizard,"Victor Fleming, George Cukor, Mervyn LeRoy, No...",3840700.0,tt0032138,8.1,The Wizard of Oz,308351,100.0,Dorothy Gale is swept away from a farm in Kans...,https://images-na.ssl-images-amazon.com/images...,99,102,http://thewizardofoz.warnerbros.com/
3,Won 5 Oscars. Another 2 wins & 7 nominations.,wilson2016,Henry King,0.0,tt0037465,6.9,Wilson,1048,,A chronicle of the political career of US Pres...,https://images-na.ssl-images-amazon.com/images...,88,154,
4,Nominated for 6 Oscars. Another 1 win.,songtosong,Charles Vidor,0.0,tt0038104,6.8,A Song to Remember,982,,Biography of Frederic Chopin.,https://images-na.ssl-images-amazon.com/images...,0,113,
5,,cloakanddagger,Fritz Lang,0.0,tt0038417,6.6,Cloak and Dagger,1782,,"In WW2, the Allies race against time to persua...",https://images-na.ssl-images-amazon.com/images...,75,106,
6,Won 1 Oscar. Another 2 wins & 3 nominations.,razorsedge,Edmund Goulding,0.0,tt0038873,7.5,The Razor's Edge,4207,,An adventuresome young man goes off to find hi...,https://images-na.ssl-images-amazon.com/images...,83,145,
7,,batmanrobin,Spencer Gordon Bennet,0.0,tt0041162,6.3,Batman and Robin,1270,,"The caped crusaders versus The Wizard, black-h...",https://images-na.ssl-images-amazon.com/images...,0,263,
8,,reckless,Max Ophüls,0.0,tt0041786,7.3,The Reckless Moment,2992,,After discovering the dead body of her teenage...,https://images-na.ssl-images-amazon.com/images...,0,82,
9,Won 1 Oscar. Another 3 wins & 10 nominations.,bornyesterday,George Cukor,0.0,tt0042276,7.6,Born Yesterday,7794,,A tycoon hires a tutor to teach his lover prop...,https://images-na.ssl-images-amazon.com/images...,95,103,


Looks good! There were some primary key errors, presumably because of imperfect title matching (two movies in the box office having the same IMDB_ID because of a failed search). But we can tolerate those errors.

Now for the hard part: the many-to-many tables for **actors** and **genres**. _In the future, we might do this for writers, too, but that data is fairly messy and unstructured, so for now let's stick with these two._

### Step 1: Regular Tables

In [20]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'Actors'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (Actor varchar(250),
                        PRIMARY KEY(Actor)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

In [22]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'Genres'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (Genre varchar(250),
                        PRIMARY KEY(Genre)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

### Step 2: Linking Tables

In [24]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'Actors_Movies'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (Actor varchar(250),
                        IMDB_ID varchar(250),
                        PRIMARY KEY(Actor, IMDB_ID),
                        FOREIGN KEY(IMDB_ID) REFERENCES Movies.Movies_IMDB(IMDB_ID),
                        FOREIGN KEY(Actor) REFERENCES Movies.Actors(Actor)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

In [26]:
cursor = con.cursor()
db_name = 'Movies'
table_name = 'Genres_Movies'
drop_table_query = '''DROP TABLE IF EXISTS {db}.{table}'''.format(db=db_name, table=table_name)
create_table_query = '''CREATE TABLE IF NOT EXISTS {db}.{table}
                        (Genre varchar(250),
                        IMDB_ID varchar(250),
                        PRIMARY KEY(Genre, IMDB_ID),
                        FOREIGN KEY(IMDB_ID) REFERENCES Movies.Movies_IMDB(IMDB_ID),
                        FOREIGN KEY(Genre) REFERENCES Movies.Genres(Genre)
                        )'''.format(db=db_name, table=table_name)
cursor.execute(drop_table_query)
cursor.execute(create_table_query)
cursor.close()

(Now for the _really_ hard part: looping through the IMDB IDs to populate these tables.

In [27]:
actors_query = '''INSERT IGNORE INTO Movies.Actors (Actor) 
                    VALUES (%s)'''
actors_movies_query ='''INSERT IGNORE INTO Movies.Actors_Movies (Actor, IMDB_ID) 
                        VALUES (%s, %s)'''

genres_query = '''INSERT IGNORE INTO Movies.Genres (Genre) 
                    VALUES (%s)'''
genres_movies_query= '''INSERT IGNORE INTO Movies.Genres_Movies (Genre, IMDB_ID) 
                        VALUES (%s, %s)'''

cursor = con.cursor()

for i in range(len(df)):
    imdb_id = df["imdbID"][i]
    actors = df['Actors'][i].split(',')
    genres = df['Genre'][i].split(',')
    for actor in actors:
        actor = actor.strip(' ')
        actors_parameters = tuple([actor])
        actors_movies_parameters = (actor, imdb_id)
        cursor.execute(actors_query, actors_parameters)
        cursor.execute(actors_movies_query, actors_movies_parameters)
    for genre in genres:
        genre = genre.strip(' ')
        genres_parameters = tuple([genre])
        genres_movies_parameters = (genre, imdb_id)
        cursor.execute(genres_query, genres_parameters)
        cursor.execute(genres_movies_query, genres_movies_parameters)
con.commit()
cursor.close()



































































In [31]:
SQLquery_df('''SELECT A.Actor, A.IMDB_ID, M.IMDB_Title
                FROM Movies.Actors_Movies A INNER JOIN Movies.Movies_IMDB M ON A.IMDB_ID = M.IMDB_ID
                WHERE A.Actor = \'Christian Bale\' ''')

Unnamed: 0,Actor,IMDB_ID,IMDB_Title
0,Christian Bale,tt0092965,Empire of the Sun
1,Christian Bale,tt0162650,Shaft
2,Christian Bale,tt0238112,Captain Corelli's Mandolin
3,Christian Bale,tt0253556,Reign of Fire
4,Christian Bale,tt0372784,Batman Begins
5,Christian Bale,tt0438488,Terminator Salvation
6,Christian Bale,tt0468569,The Dark Knight
7,Christian Bale,tt0482571,The Prestige
8,Christian Bale,tt0964517,The Fighter
9,Christian Bale,tt1152836,Public Enemies


There you have it! Our database is complete.