# IMDB Top 250 Data Analysis

This is an analysis of the Internet Movie Database's Top 250 movies scraped on June 7th, 2016. I'm using this notebook as a scratchpad where I'll find interesting trends in the data and transfer over to a webpage.

In [1]:
import numpy as np

from pymongo import MongoClient
from imdb_data_parser import IMDBDataParser
from imdb_scraper import IMDBScraper

### Download IMDB Movie Data

Download the movie data from IMDB.

In [44]:
#imdb_scraper = IMDBScraper()
#imdb_scraper.download_data()

### MongoDB

Load the MongoDB client and collection. Create the database, parse the movie data and insert it into the collection if the data doesn't already exist. Otherwise, ignore.

In [45]:
client = MongoClient()
db = client.imdb
collection = db.movies

if 'movies' not in db.collection_names():
    imdb_parser = IMDBDataParser()
    movie_data = imdb_parser.parse_files('pages')
    collection.insert_many(movie_data)

### Analysis

Analyze the data using MongoDB's Aggregation Pipeline

_What movie title has the fewest number of letters? What movie title has the highest number of letters?_

In [4]:
movie_titles = collection.find({}, {'_id': 0, 'title': 1})
fewest = movie_titles[0]['title']
most = movie_titles[0]['title']
for r in movie_titles:
    t = r['title']
    if len(t) < len(fewest):
        fewest = t
    if len(t) > len(most):
        most = t
print "Fewest: {}".format(fewest)
print "Most: {}".format(most)

Fewest: M
Most: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb


_What are the most and least popular genres?_

In [5]:
unwind = {'$unwind': '$genres'}
group = {'$group' : {'_id': '$genres', 'count': {'$sum': 1}}}
sort = {'$sort': {'count': -1}}
movie_genres = collection.aggregate([unwind, group, sort])
for a in movie_genres:
    print "{}: {}".format(a['_id'], a['count'])

Drama: 171
Adventure: 64
Crime: 55
Comedy: 39
Action: 37
Thriller: 34
Mystery: 28
Biography: 25
Sci-Fi: 24
Romance: 21
Fantasy: 21
Animation: 19
War: 18
History: 14
Family: 13
Horror: 9
Film-Noir: 8
Western: 7
Sport: 4
Musical: 2
Music: 1


_Which directors have the most films in the Top 250 and what are they?_

In [6]:
group = {'$group': {'_id': '$director', 'count': {'$sum': 1}, 'titles': {'$push': '$title'}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 10}
movie_directors = collection.aggregate([group, sort, limit])
for d in movie_directors:
    print "Director: {}".format(d['_id'])
    print "Number of movies: {}".format(d['count'])
    print ', '.join(d['titles'])
    print '\n'

Director: Alfred Hitchcock
Number of movies: 7
Rear Window, Rebecca, Dial M for Murder, Psycho, North by Northwest, Strangers on a Train, Vertigo


Director: Steven Spielberg
Number of movies: 7
Catch Me If You Can, Jurassic Park, Schindler's List, Indiana Jones and the Last Crusade, Raiders of the Lost Ark, Jaws, Saving Private Ryan


Director: Christopher Nolan
Number of movies: 7
The Prestige, Memento, Batman Begins, The Dark Knight, Interstellar, Inception, The Dark Knight Rises


Director: Stanley Kubrick
Number of movies: 7
2001: A Space Odyssey, A Clockwork Orange, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Barry Lyndon, The Shining, Full Metal Jacket, Paths of Glory


Director: Martin Scorsese
Number of movies: 7
Shutter Island, Raging Bull, The Wolf of Wall Street, Goodfellas, Taxi Driver, The Departed, Casino


Director: Hayao Miyazaki
Number of movies: 6
Howl's Moving Castle, Castle in the Sky, Spirited Away, My Neighbor Totoro, Princess Mononoke, 

_Which actors/actresses have appeared in the most films in the Top 250? What movies have they appeared in?_

In [7]:
unwind = {'$unwind': '$cast'}
group = {'$group': {'_id': '$cast.actor', 'count': {'$sum': 1}, 'titles': {'$push': '$title'}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 50}

movie_actors = collection.aggregate([unwind, group, sort, limit])
for a in movie_actors:
    print "Actor: {}".format(a['_id'].encode('utf-8'))
    print "Number of movies: {}".format(a['count'])
    print ', '.join(a['titles'])
    print '\n'

Actor: Harrison Ford
Number of movies: 8
Star Wars: Episode V - The Empire Strikes Back, Star Wars: Episode IV - A New Hope, Apocalypse Now, Indiana Jones and the Last Crusade, Star Wars: The Force Awakens, Star Wars: Episode VI - Return of the Jedi, Raiders of the Lost Ark, Blade Runner


Actor: Robert De Niro
Number of movies: 8
Once Upon a Time in America, Raging Bull, Goodfellas, The Deer Hunter, Heat, The Godfather: Part II, Taxi Driver, Casino


Actor: Leonardo DiCaprio
Number of movies: 7
Catch Me If You Can, Shutter Island, Django Unchained, The Wolf of Wall Street, Inception, The Revenant, The Departed


Actor: Morgan Freeman
Number of movies: 7
Million Dollar Baby, Batman Begins, The Shawshank Redemption, Se7en, The Dark Knight, Unforgiven, The Dark Knight Rises


Actor: Michael Caine
Number of movies: 6
The Prestige, Batman Begins, The Dark Knight, Interstellar, Inception, The Dark Knight Rises


Actor: Alec Guinness
Number of movies: 6
Star Wars: Episode V - The Empire Stri

_In which countries were each of the Top 250 movies originally released?_

In [8]:
unwind = {'$unwind': '$country'}
group = {'$group': {'_id': '$country', 'count': {'$sum': 1}}}
sort = {'$sort': {'count': -1}}
movie_countries = collection.aggregate([unwind, group, sort])
for c in movie_countries:
    print "{}: {}".format(c['_id'], c['count'])

USA: 182
UK: 46
France: 18
Germany: 15
Italy: 14
Japan: 14
Spain: 7
Hong Kong: 6
India: 5
Sweden: 5
West Germany: 5
Canada: 5
New Zealand: 3
Australia: 3
Mexico: 3
Ireland: 3
South Korea: 2
China: 2
Iran: 2
Argentina: 2
Austria: 2
Soviet Union: 1
Taiwan: 1
Morocco: 1
South Africa: 1
Libya: 1
Lebanon: 1
Denmark: 1
Switzerland: 1
Brazil: 1
Algeria: 1
Kuwait: 1
United Arab Emirates: 1
Poland: 1


_What are the most popular character first names in the Top 250 movies?_

In [9]:
unwind = {'$unwind': '$cast'}
group = {'$group': {'_id': '$cast.character'.split(" ")[0], 'count': {'$sum': 1}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 50}

movie_characters = collection.aggregate([unwind, group, sort, limit])
for a in movie_characters:
    print "{}: {}".format(a['_id'].encode('utf-8'), (a['count']))

: 15
Additional Voices (voice): 14
Un enfant /  Child: 7
Sam: 5
Silver Assay Worker: 4
Han Solo: 4
C-3PO: 4
Captain: 4
Chewbacca: 4
Luke Skywalker: 4
Frank: 4
Judge: 4
Joe: 3
Legolas: 3
Galadriel: 3
Pippin: 3
Ernie: 3
(voice): 3
Cafe Patron: 3
Nancy: 3
Darth Vader: 3
Princess Leia: 3
Luke: 3
Union General: 3
Doctor: 3
Alfred: 3
Jimmy: 3
Mother (voice): 3
Howard: 2
Theoden: 2
Everard Proudfoot: 2
Boromir: 2
Aunt Emma: 2
Celeborn: 2
Kanta (voice): 2
Lawrence: 2
Agnes: 2
Connie: 2
Bridesmaid: 2
Bruce Wayne: 2
A Tramp (as Charlie Chaplin): 2
Guido: 2
Punk: 2
TV Commentator: 2
Brad: 2
Sallah: 2
Woody (voice): 2
Spats' Henchman: 2
Rex (voice): 2
Coroner: 2


_What are the shortest and longest movies? What is the average run time of all the movies??_

In [46]:
proj = { '$project': {'_id': 0, 'title': 1, 'duration': 1}}
for s in [-1, 1]:
    d = collection.aggregate([{'$sort': {'duration': s}}, proj, {'$limit': 1}])
    for r in d:
        print '{}: {}'.format(r['title'], r['duration'])
group = {'$group': { '_id': 'Average Movie Duration', 'dur': {'$avg': '$duration'}}}
avg_movie_duration = collection.aggregate([group])
for r in avg_movie_duration:
    print '{}: {} minutes'.format(r['_id'], r['dur'])

Gangs of Wasseypur: 320
The General: 67
Average Movie Duration: 128.964 minutes
