<h2><u> Situation </u></h2>

Now I have a nice clean dataset with all entities resolved. However the meta-data is quite sparse. All I have is the source website, the ranking they gave (-1 if no ranking provided) and the title of the movie. Some preliminary data analysis follows here but first I'm going to augment the data using alternative data sources so that I can answer some more interesting questions.

In [3]:
import pandas as pd
from collections import Counter
#!pip install git+https://github.com/alberanid/imdbpy
#!pip install imdbpy
from imdb import IMDb
data = pd.read_csv("final.csv", index_col = 'Index')
pd.options.display.max_rows = 1000
ia = IMDb()

<h2><u> Task 1 </u></h2>

Let's augment the data with as many cool data sources as possible. The first step was to link all the movie titles I got with their IMDB ID. I used the IMDbpy API to do this. I got the docs off: https://readthedocs.org/projects/imdbpy/downloads/pdf/latest/



In [None]:
'''
imdb_dict = dict((title, (None, None)) for title in data.Title)
for movie, _ in imdb_dict.items():
    imdb_movies = ia.search_movie(movie)
    for i in range(len(imdb_movies)):
        print(movie + " " + str(imdb_movies[i]) + " " + str(imdb_movies[i]['year']))
        choice = input("Correct title?: ")
        if choice == 'y':
            imdb_dict[movie] = (imdb_movies[i].movieID, imdb_movies[i]['year'])
            break

imdb_df = pd.DataFrame()
imdb_list = list()
for key, value in imdb_dict.items():
    movie = key
    imdb_id, year = value
    imdb_list.append([movie, imdb_id, year])
imdb_df = imdb_df.append(imdb_list)
imdb_df.columns = ['My_Title', 'IMDB_ID', 'Year']
imdb_df.head()
imdb_df.to_csv('imdb.csv')
'''

This was a nice exercise in fact checking all my data. I found a few inconsistencies where ampersands were dropped from the title like stan ollie -> stan & ollie. Now with the IMDB IDs I can pull out a wealth of information! 

In [15]:
ID = 1392190
movie = ia.get_movie(ID)
movie.infoset2keys
#not all features always found in infoset2keys!

{'main': ['original title',
  'cast',
  'genres',
  'runtimes',
  'countries',
  'country codes',
  'language codes',
  'color info',
  'aspect ratio',
  'sound mix',
  'box office',
  'certificates',
  'original air date',
  'rating',
  'votes',
  'cover url',
  'plot outline',
  'languages',
  'title',
  'year',
  'kind',
  'directors',
  'writers',
  'producers',
  'composers',
  'cinematographers',
  'editors',
  'editorial department',
  'casting directors',
  'production designers',
  'art directors',
  'set decorators',
  'costume designers',
  'make up department',
  'production managers',
  'assistant directors',
  'art department',
  'sound department',
  'special effects',
  'visual effects',
  'stunts',
  'camera department',
  'animation department',
  'casting department',
  'costume departmen',
  'location management',
  'music department',
  'script department',
  'transportation department',
  'miscellaneous',
  'thanks',
  'akas',
  'writer',
  'director',
  'top 250 

The features that I found interesring are selected in this features list:

In [16]:
features = ['title',
            'cast', 
            'genres',
            'runtimes',
            'box office',
            'rating',
            'votes',
            'kind',
            'directors',
            'writers',
            'producers',
            'composers',
            'cinematographers',
            'editors',
            'casting directors',
            'top 250 rank',
            'plot']

In [None]:
#reading the list produced in first block
imdb_df = pd.read_csv('imdb.csv')

The way the API works is you send it the IMDB ID and it returns a set of keys that it has available. Not all movies have all the features in them, so I need to write a function that can check if the featuers exist in the set of keys.

In [19]:
for ID in [1392190]:
    movie = ia.get_movie(ID)
    for feature in features:
        if (feature in movie.infoset2keys['main']) or (feature in movie.infoset2keys['plot']):
            print(feature)
            print(movie[feature])
            print()
        else:
            print(feature + " not available\n")

title
Mad Max: Fury Road

cast
[<Person id:0362766[http] name:_Tom Hardy_>, <Person id:0000234[http] name:_Charlize Theron_>, <Person id:0396558[http] name:_Nicholas Hoult_>, <Person id:0117412[http] name:_Hugh Keays-Byrne_>, <Person id:2890541[http] name:_Josh Helman_>, <Person id:0428923[http] name:_Nathan Jones_>, <Person id:2368789[http] name:_Zoë Kravitz_>, <Person id:2492819[http] name:_Rosie Huntington-Whiteley_>, <Person id:2142336[http] name:_Riley Keough_>, <Person id:3880181[http] name:_Abbey Lee_>, <Person id:5196907[http] name:_Courtney Eaton_>, <Person id:0397398[http] name:_John Howard_>, <Person id:0141885[http] name:_Richard Carter_>, <Person id:5208473[http] name:_Iota_>, <Person id:0760151[http] name:_Angus Sampson_>, <Person id:0353228[http] name:_Jennifer Hagan_>, <Person id:0301885[http] name:_Megan Gale_>, <Person id:0415513[http] name:_Melissa Jaffer_>, <Person id:0432970[http] name:_Melita Jurisic_>, <Person id:0428143[http] name:_Gillian Jones_>, <Person id:08

Directors, actors etc are lists contain the Person object. Here's how you unpack the object. Upon examining this feature, I realized that the only really interesting information is the name of the actor themselves, so I'll just change the actor field to only have actor names in it.

In [20]:
movie['cast'][0].personID, movie['cast'][0]['name']

('0362766', 'Tom Hardy')

The box office value is not consistent so I'm not going to make use of it. I can try scraping this value later from Mojo. 

https://github.com/situkun123/Moive_mojo_project/blob/master/Moive_mojo_project.ipynb

For now, I'm focusing on adding the features I have. I'm going to use a nested dictionary to do this. Occasionally the API fails to fetch data and to catch that I used a try-except block. If any are found I can manually add them to the dictionary before I pickle it.

In [24]:
#to find
#imdb_dict_copy  = imdb_dict

In [40]:
def create_imdb_dict():
    imdb_dict = dict()
    err = []
    for i in range(len(imdb_df)):
        print("Collecting information for: " + imdb_df.loc[i].My_Title)
        print(str(i) + "/" + str(len(imdb_df)))
        ID = imdb_df.loc[i].IMDB_ID
        movie = ia.get_movie(ID)
        imdb_dict[ID] = dict()
        for feature in features:
            if (feature in movie.infoset2keys['main']) or (feature in movie.infoset2keys['plot']):
                try:
                    imdb_dict[ID][feature] = movie[feature]
                except:
                    err.append(ID)
    return imdb_dict, err
#imdb_dict, err = create_imdb_dict()

Collecting information for: melancholia
0/427
Collecting information for: mad max: fury road
1/427
Collecting information for: the tree of life
2/427
Collecting information for: the rider
3/427
Collecting information for: a separation
4/427
Collecting information for: moonlight
5/427
Collecting information for: the fits
6/427
Collecting information for: margaret
7/427
Collecting information for: spider-man: into the spider-verse
8/427
Collecting information for: the florida project
9/427
Collecting information for: actress
10/427
Collecting information for: its such a beautiful day
11/427
Collecting information for: hell or high water
12/427
Collecting information for: parasite
13/427
Collecting information for: under the skin
14/427
Collecting information for: the handmaiden
15/427
Collecting information for: cameraperson
16/427
Collecting information for: once upon a time in hollywood
17/427
Collecting information for: clouds of sils maria
18/427
Collecting information for: first ref

Collecting information for: kill list
167/427
Collecting information for: joker
168/427
Collecting information for: avengers: infinity war
169/427
Collecting information for: blue valentine
170/427
Collecting information for: blue is the warmest colour
171/427
Collecting information for: it follows
172/427
Collecting information for: star wars: the force awakens
173/427
Collecting information for: the raid
174/427
Collecting information for: son of saul
175/427
Collecting information for: miss bala
176/427
Collecting information for: the immigrant
177/427
Collecting information for: the comedy
178/427
Collecting information for: the arbor
179/427
Collecting information for: high life
180/427
Collecting information for: drug war
181/427
Collecting information for: weekend
182/427
Collecting information for: happy hour
183/427
Collecting information for: mudbound
184/427
Collecting information for: stranger by the lake
185/427
Collecting information for: computer chess
186/427
Collecting

Collecting information for: love & friendship
324/427
Collecting information for: lucky
325/427
Collecting information for: maiden
326/427
Collecting information for: the martian
327/427
Collecting information for: the avengers
328/427
Collecting information for: mcqueen
329/427
Collecting information for: mission: impossible - rogue nation
330/427
Collecting information for: mud
331/427
Collecting information for: my life as a zucchini
332/427
Collecting information for: the nice guys
333/427
Collecting information for: the old man & the gun
334/427
Collecting information for: paddington
335/427
Collecting information for: pain and glory
336/427
Collecting information for: the peanut butter falcon
337/427
Collecting information for: the post
338/427
Collecting information for: rocketman
339/427
Collecting information for: the salesman
340/427
Collecting information for: searching
341/427
Collecting information for: shaun the sheep movie
342/427
Collecting information for: shazam
343/4

This process takes a fairly long time so I'm pickling the dictionary in case the notebook crashes.

In [42]:
import pickle
pickle.dump(imdb_dict , open("imdb.p", "wb"))
#imdb_dict = pickle.load(open( "imdb.p", "rb" ))

Above you can see what one item in this dictionary looks like. Let's now engineer some better features before we add everything to one large dataframe.

1. For actors, directors etc. I want to just have a list of names (not Person object) and maybe limit it say top 10 billed actors in the movie.

2. For plot I want to use NLP techniques to extract important keywords. I can then one hot encode this.

3. Finally one hot encode all list features.

4. Runtimes to be just 1 value from a list

The box office feature had inconsistent values as seen below. Some have the budget and gross, some just the opening weekend.


In [47]:
for key in imdb_dict.keys():
    if 'box office' in imdb_dict[key]:
        print(imdb_dict[key]['title'])
        print(imdb_dict[key]['box office'])
    else:
        break

Melancholia
{'Budget': '$7,400,000 (estimated)', 'Opening Weekend Denmark': 'DKK958,848, 29 May 2011'}
Mad Max: Fury Road
{'Budget': '$150,000,000 (estimated)', 'Cumulative Worldwide Gross': '$378,436,354'}
The Tree of Life
{'Budget': '$32,000,000 (estimated)', 'Opening Weekend United States': '$493,788, 30 May 2011', 'Cumulative Worldwide Gross': '$54,303,319, 27 Oct 2011'}
The Rider
{'Opening Weekend United States': '$42,244, 15 Apr 2018'}
A Separation
{'Budget': '$500,000 (estimated)', 'Opening Weekend Iran': '$100,000, 19 Mar 2011', 'Cumulative Worldwide Gross': '$24,426,169'}
Moonlight
{'Budget': '$1,500,000 (estimated)', 'Opening Weekend United States': '$1,488,740, 18 Nov 2016', 'Cumulative Worldwide Gross': '$55,561,162, 20 Mar 2017'}
The Fits
{'Opening Weekend United States': '$11,300, 05 Jun 2016'}
Margaret
{'Budget': '$14,000,000 (estimated)', 'Opening Weekend United States': '$7,525, 02 Oct 2011'}
Spider-Man: Into the Spider-Verse
{'Budget': '$90,000,000 (estimated)', 'Open

<h2><u> Task 2 </u></h2>
Preliminary data analysis

In [48]:
data.head()

Unnamed: 0_level_0,Title,Website,Rank
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,melancholia,vulture,1
1,mad max: fury road,vulture,2
2,the tree of life,vulture,3
3,the rider,vulture,4
4,a separation,vulture,5


<b> Question 1: </b> What were the top 10 movies mentioned across all the sources?

<b>Answer:</b>

In [49]:
#double square brackets to keep it as a dataframe
#columns option is required when using double braces
#have the column before the sum function in group_by
data.groupby('Title')[['Website']].count().nlargest(columns = 'Website', n = 10)

Unnamed: 0_level_0,Website
Title,Unnamed: 1_level_1
mad max: fury road,29
moonlight,28
get out,26
the social network,22
inside llewyn davis,19
boyhood,18
lady bird,18
roma,16
call me by your name,15
parasite,15


Okay but some of the data was ranked and some unranked, let's see what lists emerge from the ranked lists and the unranked lists. To split them up I use the split-apply-combine pattern of pandas.

<b> Question 2: </b> What were the top 10 movies mentioned across ranked and unranked sources?

<b>Answer:</b>

In [53]:
def is_rank(x):
    if x == -1: return 0
    else: return 1

data['is_ranked'] = data['Rank'].apply(lambda x: is_rank(x))
ranked_data = data[data['is_ranked'] == 1]
unranked_data = data[data['is_ranked'] == 0]
unranked_data.groupby('Title')[['Website']].count().nlargest(columns = 'Website', n = 10)

Unnamed: 0_level_0,Website
Title,Unnamed: 1_level_1
moonlight,9
get out,8
the social network,8
boyhood,7
lady bird,7
mad max: fury road,7
roma,7
black panther,6
her,5
inception,5


In [54]:
ranked_data.groupby('Title')[['Website']].count().nlargest(columns = 'Website', n = 10)

Unnamed: 0_level_0,Website
Title,Unnamed: 1_level_1
mad max: fury road,22
moonlight,19
get out,18
inside llewyn davis,14
the social network,14
the master,13
the act of killing,12
the grand budapest hotel,12
boyhood,11
call me by your name,11


<b> Question 3: </b> Who were the top actors, directors, writers etc.?

<b>Answer:</b>

In [153]:
#function to get values of a certain key from a nested dictionary
def top_n_list(dic, person_key, n, top_billing = 0):
    people = []
    if top_billing == 0:
        for key in dic.keys():
            if person_key in dic[key]:
                for person in dic[key][person_key]:
                    try:
                        people.append((person.personID, person['name']))
                    except:
                        continue
            else:
                continue
    else:
        for key in dic.keys():
            if person_key in dic[key]:
                for person in dic[key][person_key][0 : top_billing]:
                    people.append((person.personID, person['name']))
            else:
                continue
    count_people = Counter(people)
    return count_people.most_common()[0:n]

In [154]:
x = top_n_list(imdb_dict, 'cast', 10)
x

[(('5241466', 'Mark Falvo'), 22),
 (('7419291', 'Arnold Montey'), 14),
 (('0171625', 'Bern Collaço'), 14),
 (('0498278', 'Stan Lee'), 13),
 (('3485845', 'Adam Driver'), 11),
 (('0424060', 'Scarlett Johansson'), 10),
 (('5857646', 'Joseph Oliveira'), 10),
 (('0842770', 'Tilda Swinton'), 10),
 (('6768665', 'Patti Schellhaas'), 10),
 (('0000168', 'Samuel L. Jackson'), 10)]

The top 4 actors are all extras actors! Luckily the cast is sorted by their billing order so I can use the top_n feature of my function to just select the top 20 billed actors from each movie which I think is a fair way to find the top actors.

<h3> Top actors of the decade</h3>

In [155]:
x = top_n_list(imdb_dict, 'cast', n = 10, top_billing = 20)
x

[(('0424060', 'Scarlett Johansson'), 10),
 (('3485845', 'Adam Driver'), 10),
 (('1209966', 'Oscar Isaac'), 8),
 (('0000168', 'Samuel L. Jackson'), 8),
 (('0842770', 'Tilda Swinton'), 8),
 (('0749263', 'Mark Ruffalo'), 7),
 (('1727304', 'Domhnall Gleeson'), 7),
 (('0331516', 'Ryan Gosling'), 7),
 (('1256532', 'Jon Bernthal'), 7),
 (('0262635', 'Chris Evans'), 7)]

There we go! The critic favorites of the decade!

<h3> Top directors of the decade</h3>

In [157]:
x = top_n_list(imdb_dict, 'directors', n = 10)
x

[(('0898288', 'Denis Villeneuve'), 5),
 (('0634240', 'Christopher Nolan'), 4),
 (('0169806', 'Taika Waititi'), 4),
 (('0751577', 'Anthony Russo'), 4),
 (('0751648', 'Joe Russo'), 4),
 (('0000233', 'Quentin Tarantino'), 3),
 (('0000759', 'Paul Thomas Anderson'), 3),
 (('0000229', 'Steven Spielberg'), 3),
 (('0487166', 'Yorgos Lanthimos'), 3),
 (('1509478', 'Benny Safdie'), 3)]

<h3> Top producers of the decade</h3>

In [158]:
x = top_n_list(imdb_dict, 'producers', n = 10)
x

[(('0748784', 'Scott Rudin'), 16),
 (('0498278', 'Stan Lee'), 16),
 (('5398118', 'Olivier Père'), 16),
 (('2691892', 'Megan Ellison'), 13),
 (('0000093', 'Brad Pitt'), 12),
 (('0022285', 'Victoria Alonso'), 12),
 (('0195669', "Louis D'Esposito"), 12),
 (('0270559', 'Kevin Feige'), 12),
 (('0743882', 'Tessa Ross'), 11),
 (('4791912', 'Eli Bush'), 11)]

<h3> Top composers of the decade</h3>

In [159]:
x = top_n_list(imdb_dict, 'composers', n = 10)
x

[(('0006035', 'Alexandre Desplat'), 10),
 (('0315974', 'Michael Giacchino'), 10),
 (('0001877', 'Hans Zimmer'), 8),
 (('0002354', 'John Williams'), 5),
 (('0001937', 'Marco Beltrami'), 5),
 (('0339351', 'Jonny Greenwood'), 4),
 (('0002353', 'Thomas Newman'), 4),
 (('0001980', 'Carter Burwell'), 4),
 (('2273444', 'Henry Jackman'), 4),
 (('0510533', 'Giong Lim'), 4)]

<h3> Top cinematographers of the decade</h3>

In [160]:
x = top_n_list(imdb_dict, 'cinematographers', n = 10)
x

[(('0005683', 'Roger Deakins'), 6),
 (('0523881', 'Emmanuel Lubezki'), 4),
 (('0393240', 'Kyung-pyo Hong'), 4),
 (('0494617', 'Yorick Le Saux'), 4),
 (('0568174', 'Michael McDonough'), 4),
 (('0887227', 'Hoyte Van Hoytema'), 4),
 (('0451787', 'Darius Khondji'), 4),
 (('1831620', 'Sean Porter'), 4),
 (('0006509', 'Rodrigo Prieto'), 4),
 (('1023204', 'Ben Davis'), 4)]

<h3> Top editors of the decade</h3>

In [161]:
x = top_n_list(imdb_dict, 'editors', n = 10)
x

[(('0907863', 'Joe Walker'), 6),
 (('0328557', 'Affonso Gonçalves'), 5),
 (('0809059', 'Lee Smith'), 5),
 (('0285701', 'Jeffrey Ford'), 5),
 (('0711235', 'Fred Raskin'), 4),
 (('0561430', 'Yorgos Mavropsaridis'), 4),
 (('0918733', 'Andrew Weisblum'), 4),
 (('2352780', 'Jennifer Lame'), 4),
 (('0773113', 'Matthew Schmidt'), 4),
 (('1477623', 'Nat Sanders'), 3)]

<h3> Top writers of the decade</h3>

In [162]:
x = top_n_list(imdb_dict, 'writers', n = 10)
x

[(('0456158', 'Jack Kirby'), 13),
 (('0498278', 'Stan Lee'), 11),
 (('0094435', 'Bong Joon Ho'), 5),
 (('0634240', 'Christopher Nolan'), 5),
 (('1921680', 'Steve Englehart'), 5),
 (('0004056', 'Andrew Stanton'), 5),
 (('0800209', 'Joe Simon'), 5),
 (('0027572', 'Wes Anderson'), 5),
 (('0520488', 'Phil Lord'), 4),
 (('4401003', 'Derek Kolstad'), 4)]

<h3> Top genres of the decade</h3>

In [138]:
imdb_dict[1392190]['genres']

['Action', 'Adventure', 'Sci-Fi', 'Thriller']

In [164]:
genres = []
for key in imdb_dict.keys():
    for genre in imdb_dict[key]['genres']:
        genres.append(genre)
count_people = Counter(genres)
count_people.most_common()[0:10]

[('Drama', 299),
 ('Comedy', 98),
 ('Thriller', 93),
 ('Adventure', 63),
 ('Crime', 63),
 ('Action', 60),
 ('Documentary', 59),
 ('Romance', 58),
 ('Sci-Fi', 54),
 ('Biography', 49)]

Dramas just dominate the film market. I'm quite surprised that comedies was the second most repeated genre, just because of how badly pure comedy movies tend to get rated on IMDB and Rotten Tomatoes.