<h2><u> Situation </u></h2>

Now I have a nice clean dataset with all entities resolved. However the meta-data is quite sparse. All I have is the source website, the ranking they gave (-1 if no ranking provided) and the title of the movie. Some preliminary data analysis follows here but after that I'm going to augment the data using alternative data sources so that I can answer some more interesting questions.

<h2><u> Task 1 </u></h2>
Preliminary data analysis

In [1]:
import pandas as pd
data = pd.read_csv("final.csv", index_col = 'Index')
pd.options.display.max_rows = 1000

In [2]:
data.head()

Unnamed: 0_level_0,Title,Website,Rank
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,melancholia,vulture,1
1,mad max: fury road,vulture,2
2,the tree of life,vulture,3
3,the rider,vulture,4
4,a separation,vulture,5


<b> Question 1: </b> What were the top 10 movies mentioned across all the sources?

<b>Answer:</b>

In [3]:
#double square brackets to keep it as a dataframe
#columns option is required when using double braces
#have the column before the sum function in group_by
data.groupby('Title')[['Website']].count().nlargest(columns = 'Website', n = 10)

Unnamed: 0_level_0,Website
Title,Unnamed: 1_level_1
mad max: fury road,29
moonlight,28
get out,26
the social network,22
inside llewyn davis,19
boyhood,18
lady bird,18
roma,16
call me by your name,15
parasite,15


Okay but some of the data was ranked and some unranked, let's see what lists emerge from the ranked lists and the unranked lists. To split them up I use the split-apply-combine pattern of pandas.

<b> Question 2: </b> What were the top 10 movies mentioned across ranked and unranked sources?

<b>Answer:</b>

In [4]:
def is_rank(x):
    if x == -1: return 0
    else: return 1

data['is_ranked'] = data['Rank'].apply(lambda x: is_rank(x))

In [5]:
ranked_data = data[data['is_ranked'] == 1]
unranked_data = data[data['is_ranked'] == 0]

In [6]:
unranked_data.groupby('Title')[['Website']].count().nlargest(columns = 'Website', n = 10)

Unnamed: 0_level_0,Website
Title,Unnamed: 1_level_1
moonlight,9
get out,8
the social network,8
boyhood,7
lady bird,7
mad max: fury road,7
roma,7
black panther,6
her,5
inception,5


In [7]:
ranked_data.groupby('Title')[['Website']].count().nlargest(columns = 'Website', n = 10)

Unnamed: 0_level_0,Website
Title,Unnamed: 1_level_1
mad max: fury road,22
moonlight,19
get out,18
inside llewyn davis,14
the social network,14
the master,13
the act of killing,12
the grand budapest hotel,12
boyhood,11
call me by your name,11


<h2><u> Task 2 </u></h2>

Let's augment the data with as many cool data sources as possible. First I use the IMDB data. I got the docs off: https://readthedocs.org/projects/imdbpy/downloads/pdf/latest/

Lincoln -> 

In [12]:
#!pip install git+https://github.com/alberanid/imdbpy
#!pip install imdbpy
from imdb import IMDb
ia = IMDb()
imdb_dict = dict((title, (None, None)) for title in data.Title)
for movie, _ in imdb_dict.items():
    imdb_movies = ia.search_movie(movie)
    for i in range(len(imdb_movies)):
        print(movie + " " + str(imdb_movies[i]) + " " + str(imdb_movies[i]['year']))
        choice = input("Correct title?: ")
        if choice == 'y':
            imdb_dict[movie] = (imdb_movies[i].movieID, imdb_movies[i]['year'])
            break

melancholia Melancholia 2011
Correct title?: y
mad max: fury road Mad Max: Fury Road 2015
Correct title?: y
the tree of life The Tree of Life 2011
Correct title?: y
the rider The Rider 2017
Correct title?: y
a separation A Separation 2011
Correct title?: y
moonlight Moonlight 2016
Correct title?: y
the fits The Fits 2015
Correct title?: y
margaret Margaret 2011
Correct title?: y
spider-man: into the spider-verse Spider-Man: Into the Spider-Verse 2018
Correct title?: y
the florida project The Florida Project 2017
Correct title?: y
actress Actress 2014
Correct title?: y
its such a beautiful day It's Such a Beautiful Day 2012
Correct title?: y
hell or high water Hell or High Water 2016
Correct title?: y
parasite Parasite 2019
Correct title?: y
under the skin Under the Skin 2013
Correct title?: y
the handmaiden The Handmaiden 2016
Correct title?: y
cameraperson Cameraperson 2016
Correct title?: y
once upon a time in hollywood Once Upon a Time... in Hollywood 2019
Correct title?: y
clouds o

monsters Monsters 2010
Correct title?: y
beasts of the southern wild Beasts of the Southern Wild 2012
Correct title?: y
cold war Cold War 2018
Correct title?: y
thor: ragnarok Thor: Ragnarok 2017
Correct title?: y
uncle boonmee who can recall his past lives Uncle Boonmee Who Can Recall His Past Lives 2010
Correct title?: y
the artist The Artist 2011
Correct title?: y
the lobster The Lobster 2015
Correct title?: y
the lost city of z The Lost City of Z 2016
Correct title?: y
captain america: the winter soldier Captain America: The Winter Soldier 2014
Correct title?: y
shame Shame 2011
Correct title?: y
four lions Four Lions 2010
Correct title?: y
the dark knight rises The Dark Knight Rises 2012
Correct title?: y
looper Looper 2012
Correct title?: y
nocturama Nocturama 2016
Correct title?: y
columbus 1492: Conquest of Paradise 1992
Correct title?: n
columbus Columbus 2017
Correct title?: y
american honey American Honey 2016
Correct title?: y
carlos Carlos 2010
Correct title?: y
kill list 

blackfish Blackfish 2013
Correct title?: y
blackkklansman BlacKkKlansman 2018
Correct title?: y
bridge of spies Bridge of Spies 2015
Correct title?: y
bumblebee Bumblebee 2018
Correct title?: y
can you ever forgive me Can You Ever Forgive Me? 2018
Correct title?: y
captain america: civil war Captain America: Civil War 2016
Correct title?: y
captain phillips Captain Phillips 2013
Correct title?: y
crazy rich asians Crazy Rich Asians 2018
Correct title?: y
dallas buyers club Dallas Buyers Club 2013
Correct title?: y
dawn of the planet of the apes Dawn of the Planet of the Apes 2014
Correct title?: y
the death of stalin The Death of Stalin 2017
Correct title?: y
the disaster artist The Disaster Artist 2017
Correct title?: y
doctor strange Doctor Strange 2016
Correct title?: y
dolemite is my name Dolemite Is My Name 2019
Correct title?: y
don t think twice Don't Think Twice 2016
Correct title?: y
eye in the sky Eye in the Sky 2015
Correct title?: y
a fantastic woman A Fantastic Woman 2017


girlhood Girlhood 2014
Correct title?: y
white material White Material 2009
Correct title?: y
suspiria Suspiria 2018
Correct title?: y
foxcatcher Foxcatcher 2014
Correct title?: y
killing them softly Killing Them Softly 2012
Correct title?: y
enemy Enemy 2013
Correct title?: y
the assassin The Assassin 2015
Correct title?: y
jackie Jackie 2016
Correct title?: y


old man and the gun, love and friendship

lemonade Limonata 2015
The Tribe

In [16]:
imdb_dict

{'12 years a slave': ('2024544', 2013),
 '1917': ('8579674', 2019),
 '20 feet from stardom': ('2396566', 2013),
 '20th century women': ('4385888', 2016),
 '45 years': ('3544082', 2015),
 'a beautiful day in the neighborhood': ('3224458', 2019),
 'a bread factory part one part two': ('6884380', 2018),
 'a dark song': ('4805316', 2016),
 'a fantastic woman': ('5639354', 2017),
 'a film unfinished': ('1568923', 2010),
 'a ghost story': ('6265828', 2017),
 'a girl walks home alone at night': ('2326554', 2014),
 'a prophet': ('1235166', 2009),
 'a quiet passion': ('2392830', 2016),
 'a quiet place': ('6644200', 2018),
 'a screaming man': ('1639901', 2010),
 'a separation': ('1832382', 2011),
 'a star is born': ('1517451', 2018),
 'a touch of sin': ('2852400', 2013),
 'actress': ('3212392', 2014),
 'ad astra': ('2935510', 2019),
 'all is lost': ('2017038', 2013),
 'amazing grace': ('4935462', 2018),
 'american honey': ('3721936', 2016),
 'american hustle': ('1800241', 2013),
 'amour': ('1602

In [18]:
imdb_df = pd.DataFrame()
imdb_list = list()
for key, value in imdb_dict.items():
    movie = key
    imdb_id, year = value
    imdb_list.append([movie, imdb_id, year])
imdb_df = imdb_df.append(imdb_list)
imdb_df.columns = ['My_Title', 'IMDB_ID', 'Year']
imdb_df.head()

Unnamed: 0,My_Title,IMDB_ID,Year
0,melancholia,1527186,2011
1,mad max: fury road,1392190,2015
2,the tree of life,478304,2011
3,the rider,6217608,2017
4,a separation,1832382,2011


In [20]:
imdb_df.to_csv('imdb.csv')

This was a nice exercise in fact checking all my data. I found one inconsistencie as well:
stan ollie -> stan & ollie. Now with the IMDB IDs I can pull out a wealth of information! The features I selected were:

In [22]:
#movie = ia.get_movie(ID)
#movie.infoset2keys
#not all features always found in infoset2keys!

In [23]:
features = ['title',
            'cast', 
            'genres',
            'runtimes',
            'box office',
            'rating',
            'votes',
            'kind',
            'directors',
            'writers',
            'producers',
            'composers',
            'cinematographers',
            'editors',
            'casting directors',
            'top 250 rank']

The way the API works is you send it the IMDB ID and it returns a set of keys that it has available. Not all movies have all the features in them, so I need to write a function that can check if the featuers exist in the set of keys.

In [24]:
for ID in imdb_df.IMDB_ID[55:60]:
    movie = ia.get_movie(ID)
    for feature in features:
        if feature in movie.infoset2keys['main']:
            print(feature)
            print(movie[feature])
            print()
        else:
            print(feature + " not available\n")

title
Stories We Tell

cast
[<Person id:0689574[http] name:_Michael Polley_>, <Person id:0191409[http] name:_Harry Gulkin_>, <Person id:5305911[http] name:_Susy Buchan_>, <Person id:0117956[http] name:_John Buchan_>, <Person id:0689573[http] name:_Mark Polley_>, <Person id:0689572[http] name:_Joanna Polley_>, <Person id:0347757[http] name:_Cathy Gulkin_>, <Person id:5305979[http] name:_Marie Murphy_>, <Person id:5306026[http] name:_Robert MacMillan_>, <Person id:0846876[http] name:_Anne Tait_>, <Person id:0100831[http] name:_Deirdre Bowen_>, <Person id:0593799[http] name:_Victoria Mitchell_>, <Person id:0710372[http] name:_Mort Ransen_>, <Person id:0101133[http] name:_Geoffrey Bowes_>, <Person id:0125148[http] name:_Tom Butler_>, <Person id:0081754[http] name:_Pixie Bigelow_>, <Person id:5306023[http] name:_Claire Walker_>, <Person id:0420953[http] name:_Rebecca Jenkins_>, <Person id:5305897[http] name:_Peter Evans_>, <Person id:1835511[http] name:_Alex Hatz_>, <Person id:4563047[http]

title
Moneyball

cast
[<Person id:0000093[http] name:_Brad Pitt_>, <Person id:1706767[http] name:_Jonah Hill_>, <Person id:0000450[http] name:_Philip Seymour Hoffman_>, <Person id:0000705[http] name:_Robin Wright_>, <Person id:0695435[http] name:_Chris Pratt_>, <Person id:1212071[http] name:_Stephen Bishop_>, <Person id:0224703[http] name:_Reed Diamond_>, <Person id:0421116[http] name:_Brent Jennings_>, <Person id:0575850[http] name:_Ken Medlock_>, <Person id:0087109[http] name:_Tammy Blanchard_>, <Person id:0569079[http] name:_Jack McGee_>, <Person id:0749490[http] name:_Vyto Ruginis_>, <Person id:0780678[http] name:_Nick Searcy_>, <Person id:0607703[http] name:_Glenn Morshower_>, <Person id:4000166[http] name:_Casey Bond_>, <Person id:4021123[http] name:_Nick Porrazzo_>, <Person id:1748388[http] name:_Kerris Dorsey_>, <Person id:0397124[http] name:_Arliss Howard_>, <Person id:3003906[http] name:_Reed Thompson_>, <Person id:1255793[http] name:_James Shanklin_>, <Person id:0067053[http

title
Black Panther

cast
[<Person id:1569276[http] name:_Chadwick Boseman_>, <Person id:0430107[http] name:_Michael B. Jordan_>, <Person id:2143282[http] name:_Lupita Nyong'o_>, <Person id:1775091[http] name:_Danai Gurira_>, <Person id:0293509[http] name:_Martin Freeman_>, <Person id:2257207[http] name:_Daniel Kaluuya_>, <Person id:4004793[http] name:_Letitia Wright_>, <Person id:6328300[http] name:_Winston Duke_>, <Person id:1250791[http] name:_Sterling K. Brown_>, <Person id:0000291[http] name:_Angela Bassett_>, <Person id:0001845[http] name:_Forest Whitaker_>, <Person id:0785227[http] name:_Andy Serkis_>, <Person id:0441042[http] name:_Florence Kasumba_>, <Person id:0434712[http] name:_John Kani_>, <Person id:1605085[http] name:_David S. Lee_>, <Person id:8852246[http] name:_Nabiyah Be_>, <Person id:0207218[http] name:_Isaach De Bankolé_>, <Person id:0158448[http] name:_Connie Chiume_>, <Person id:7262074[http] name:_Dorothy Steel_>, <Person id:0764527[http] name:_Danny Sapani_>, <

Directors, actors etc are lists contain the Person object. Here's how you unpack the object.

In [25]:
movie['cast'][0].personID, movie['cast'][0]['name']

('1569276', 'Chadwick Boseman')

The box office value is not consistent so I'm not going to make use of it. I can try scraping this value later from Mojo. 

https://github.com/situkun123/Moive_mojo_project/blob/master/Moive_mojo_project.ipynb

For now, I'm focusing on adding the features I have.

In [59]:
import numpy as np
for feature in features:
    imdb_df[feature] = ""
imdb_df

Unnamed: 0.1,Unnamed: 0,My_Title,IMDB_ID,Year,title,cast,genres,runtimes,box office,rating,votes,kind,directors,writers,producers,composers,cinematographers,editors,casting directors,top 250 rank
0,0,melancholia,1527186,2011,,,,,,,,,,,,,,,,
1,1,mad max: fury road,1392190,2015,,,,,,,,,,,,,,,,
2,2,the tree of life,478304,2011,,,,,,,,,,,,,,,,
3,3,the rider,6217608,2017,,,,,,,,,,,,,,,,
4,4,a separation,1832382,2011,,,,,,,,,,,,,,,,
5,5,moonlight,4975722,2016,,,,,,,,,,,,,,,,
6,6,the fits,4238858,2015,,,,,,,,,,,,,,,,
7,7,margaret,466893,2011,,,,,,,,,,,,,,,,
8,8,spider-man: into the spider-verse,4633694,2018,,,,,,,,,,,,,,,,
9,9,the florida project,5649144,2017,,,,,,,,,,,,,,,,


In [85]:
for i in range(len(imdb_df)):
    print(str(i) + "/" + str(len(imdb_df)))
    ID = imdb_df.loc[i].IMDB_ID
    movie = ia.get_movie(ID)
    for feature in features:
        if feature in movie.infoset2keys['main']:
            print(feature)
            if type(movie[feature]) == list:
                print (movie[feature])
            else:
                imdb_df.loc[i, feature] = movie[feature]

0/427
title
cast
[<Person id:0000379[http] name:_Kirsten Dunst_>, <Person id:0001250[http] name:_Charlotte Gainsbourg_>, <Person id:0002907[http] name:_Alexander Skarsgård_>, <Person id:1227232[http] name:_Brady Corbet_>, <Person id:3999508[http] name:_Cameron Spurr_>, <Person id:0001648[http] name:_Charlotte Rampling_>, <Person id:0159802[http] name:_Jesper Christensen_>, <Person id:0000457[http] name:_John Hurt_>, <Person id:0001745[http] name:_Stellan Skarsgård_>, <Person id:0001424[http] name:_Udo Kier_>, <Person id:0000662[http] name:_Kiefer Sutherland_>, <Person id:0128555[http] name:_James Cagnard_>, <Person id:3364187[http] name:_Deborah Fronko_>, <Person id:2024499[http] name:_Charlotta Miller_>, <Person id:5631259[http] name:_Claire Miller_>, <Person id:0924264[http] name:_Gary Whitaker_>, <Person id:1575436[http] name:_Katrine Sahlstrøm_>, <Person id:1936295[http] name:_Christian Geisnæs_>, <Person id:4512373[http] name:_Stefan Cronwall_>, <Person id:8344777[http] name:_Pete

ValueError: Must have equal len keys and value when setting with an iterable

In [82]:
movie[feature][0]
movie[feature][0].personID
movie[feature][0].Name

AttributeError: 'Person' object has no attribute 'Name'