# Analyzing IMDB's Top 250 Movies

This is an analysis of the Internet Movie Database's Top 250 movies scraped on June 7th, 2016. I'm using this notebook as a scratchpad where I'll find interesting trends in the data and transfer over to a webpage.

In [2]:
import numpy as np

from pymongo import MongoClient
from imdb_data_parser import IMDBDataParser
from imdb_scraper import IMDBScraper

### Download IMDB Movie Data

Download the movie data from IMDB.

In [44]:
#imdb_scraper = IMDBScraper()
#imdb_scraper.download_data()

### MongoDB

Load the MongoDB client and collection. Create the database, parse the movie data and insert it into the collection if the data doesn't already exist. Otherwise, ignore.

In [3]:
client = MongoClient()
db = client.imdb
collection = db.movies

if 'movies' not in db.collection_names():
    imdb_parser = IMDBDataParser()
    movie_data = imdb_parser.parse_files('pages')
    collection.insert_many(movie_data)

### Analysis

Analyze the data using MongoDB's Aggregation Pipeline

_What movie title has the fewest number of letters? What movie title has the highest number of letters?_

In [18]:
movie_titles = collection.find()
fewest = movie_titles[0]['title']
most = movie_titles[0]['title']
for r in movie_titles:
    t = r['title']
    if len(t) < len(fewest):
        fewest = t
    if len(t) > len(most):
        most = t
print "Fewest: {}".format(fewest)
print "Most: {}".format(most)

Fewest: M
Most: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb


_What are the most and least popular genres?_

In [7]:
unwind = {'$unwind': '$genres'}
group = {'$group' : {'_id': '$genres', 'count': {'$sum': 1}}}
sort = {'$sort': {'count': -1}}
movie_genres = collection.aggregate([unwind, group, sort])
for a in movie_genres:
    print "{}: {}".format(a['_id'], a['count'])

Drama: 171
Adventure: 64
Crime: 55
Comedy: 39
Action: 37
Thriller: 34
Mystery: 28
Biography: 25
Sci-Fi: 24
Romance: 21
Fantasy: 21
Animation: 19
War: 18
History: 14
Family: 13
Horror: 9
Film-Noir: 8
Western: 7
Sport: 4
Musical: 2
Music: 1


_Which directors have the most films in the Top 250 and what are they?_

In [5]:
group = {'$group': {'_id': '$director', 'count': {'$sum': 1}, 'titles': {'$push': '$title'}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 20}
movie_directors = collection.aggregate([group, sort, limit])
for d in movie_directors:
    print "Director: {}".format(d['_id'])
    print "Number of movies: {}".format(d['count'])
    print ', '.join(d['titles'])
    print '\n'

Director: Stanley Kubrick
Number of movies: 7
2001: A Space Odyssey, A Clockwork Orange, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Barry Lyndon, The Shining, Full Metal Jacket, Paths of Glory


Director: Martin Scorsese
Number of movies: 7
Shutter Island, Raging Bull, The Wolf of Wall Street, Goodfellas, Taxi Driver, The Departed, Casino


Director: Christopher Nolan
Number of movies: 7
The Prestige, Memento, Batman Begins, The Dark Knight, Interstellar, Inception, The Dark Knight Rises


Director: Steven Spielberg
Number of movies: 7
Catch Me If You Can, Jurassic Park, Schindler's List, Indiana Jones and the Last Crusade, Raiders of the Lost Ark, Jaws, Saving Private Ryan


Director: Alfred Hitchcock
Number of movies: 7
Rear Window, Rebecca, Dial M for Murder, Psycho, North by Northwest, Strangers on a Train, Vertigo


Director: Akira Kurosawa
Number of movies: 6
Throne of Blood, Yojimbo, Seven Samurai, Ran, Rashômon, Ikiru


Director: Hayao Miyazaki
Number

_Which writers have the most films in the Top 250 and what are they?_

In [4]:
unwind = {'$unwind': '$writer'}
group = {'$group': {'_id': '$writer', 'count': {'$sum': 1}, 'titles': {'$push': '$title'}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 20}
movie_writers = collection.aggregate([unwind, group, sort, limit])
for d in movie_writers:
    print "Writer: {}".format(d['_id'].encode('utf-8'))
    print "Number of movies: {}".format(d['count'])
    print ', '.join(d['titles'])
    print '\n'

Writer: Stanley Kubrick
Number of movies: 7
2001: A Space Odyssey, A Clockwork Orange, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Barry Lyndon, The Shining, Full Metal Jacket, Paths of Glory


Writer: Hayao Miyazaki
Number of movies: 7
Howl's Moving Castle, Castle in the Sky, Spirited Away, My Neighbor Totoro, Princess Mononoke, Nausicaä of the Valley of the Wind, Nausicaä of the Valley of the Wind


Writer: Akira Kurosawa
Number of movies: 6
Yojimbo, Yojimbo, Seven Samurai, Ran, Rashômon, Ikiru


Writer: Quentin Tarantino
Number of movies: 6
Inglourious Basterds, Pulp Fiction, Django Unchained, Reservoir Dogs, Kill Bill: Vol. 1, Kill Bill: Vol. 1


Writer: Christopher Nolan
Number of movies: 6
The Prestige, Memento, The Dark Knight, Interstellar, Inception, The Dark Knight Rises


Writer: Billy Wilder
Number of movies: 5
The Apartment, Some Like It Hot, Witness for the Prosecution, Sunset Blvd., Double Indemnity


Writer: Pete Docter
Number of movies: 5
Mons

_Which actors/actresses have appeared in the most films in the Top 250? What movies have they appeared in?_

In [6]:
unwind = {'$unwind': '$cast'}
group = {'$group': {'_id': '$cast.actor', 'count': {'$sum': 1}, 'titles': {'$push': '$title'}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 20}

movie_actors = collection.aggregate([unwind, group, sort, limit])
for a in movie_actors:
    print "Actor: {}".format(a['_id'].encode('utf-8'))
    print "Number of movies: {}".format(a['count'])
    print ', '.join(a['titles'])
    print '\n'

Actor: Robert De Niro
Number of movies: 8
Once Upon a Time in America, Raging Bull, Goodfellas, The Deer Hunter, Heat, The Godfather: Part II, Taxi Driver, Casino


Actor: Harrison Ford
Number of movies: 8
Star Wars: Episode V - The Empire Strikes Back, Star Wars: Episode IV - A New Hope, Apocalypse Now, Indiana Jones and the Last Crusade, Star Wars: The Force Awakens, Star Wars: Episode VI - Return of the Jedi, Raiders of the Lost Ark, Blade Runner


Actor: Morgan Freeman
Number of movies: 7
Million Dollar Baby, Batman Begins, The Shawshank Redemption, Se7en, The Dark Knight, Unforgiven, The Dark Knight Rises


Actor: Leonardo DiCaprio
Number of movies: 7
Catch Me If You Can, Shutter Island, Django Unchained, The Wolf of Wall Street, Inception, The Revenant, The Departed


Actor: Alec Guinness
Number of movies: 6
Star Wars: Episode V - The Empire Strikes Back, Star Wars: Episode IV - A New Hope, Kind Hearts and Coronets, Star Wars: Episode VI - Return of the Jedi, Lawrence of Arabia, 

_In which countries were each of the Top 250 movies originally released?_

In [5]:
unwind = {'$unwind': '$country'}
group = {'$group': {'_id': '$country', 'count': {'$sum': 1}}}
sort = {'$sort': {'count': -1}}
movie_countries = collection.aggregate([unwind, group, sort])
for c in movie_countries:
    print "{}: {}".format(c['_id'], c['count'])

USA: 182
UK: 46
France: 18
Germany: 15
Italy: 14
Japan: 14
Spain: 7
Hong Kong: 6
India: 5
Sweden: 5
West Germany: 5
Canada: 5
New Zealand: 3
Australia: 3
Mexico: 3
Ireland: 3
South Korea: 2
China: 2
Iran: 2
Argentina: 2
Austria: 2
Soviet Union: 1
Taiwan: 1
Morocco: 1
South Africa: 1
Libya: 1
Lebanon: 1
Denmark: 1
Switzerland: 1
Brazil: 1
Algeria: 1
Kuwait: 1
United Arab Emirates: 1
Poland: 1


_Of the countries with more than 10 movies, what is the average rating?_

In [13]:
unwind = {'$unwind': '$country'}
group = {'$group': {'_id': '$country', 'count': {'$sum': 1}, 'avg_rating': {'$avg': '$rating'}}}
match = {'$match': {'count': {'$gt': 10}}}
sort = {'$sort': {'count': -1}}
movie_countries = collection.aggregate([unwind, group, match, sort])
for c in movie_countries:
    print "{}: {}".format(c['_id'], c['avg_rating'])

USA: 8.32967032967
UK: 8.27826086957
France: 8.32222222222
Germany: 8.34666666667
Italy: 8.33571428571
Japan: 8.32142857143


_What are the most popular character first names in the Top 250 movies?_

In [109]:
unwind = {'$unwind': '$cast'}
group = {'$group': {'_id': '$cast.character'.split(" ")[0], 'count': {'$sum': 1}}}
sort = {'$sort': {'count': -1}}
limit = {'$limit': 50}

movie_characters = collection.aggregate([unwind])
d = {}
bad = ['the', 'additional', '', 'man', 'police', 'a', 'un', '(voice)']
title = ['captain', 'young', 'general', 'detective', 'private', 
        'juror', 'sheriff', 'miss', 'old', 'uncle', 'judge', 'officer',
        'colonel', 'big', 'chief', 'sergeant', 'professor', 'inspector', 'lieutenant',
        'agent', 'princess', 'aunt', 'father', 'major', 'nurse', 'lord']
#common_names = {'mike': 'michael', 'tom': 'thomas', 'joe': 'joseph', 'bob': 'robert'}
for r in movie_characters:
    name_split = r['cast']['character'].split(' ')
    name = name_split[0].lower()
    if name in title and len(name_split) != 1:
        name = name_split[1].lower()
    if (name in bad) or name.endswith('.'):
        continue
    #if name in common_names.keys():
        #name = common_names[name]
    if name in d:
        d[name] += 1
    else:
        d[name] = 1
print sorted(d.items(), key=lambda x:x[1])[-20:]

[(u'harry', 7), (u'bill', 7), (u'michael', 7), (u'billy', 7), (u'ben', 7), (u'carl', 8), (u'robert', 8), (u'luke', 8), (u'walter', 8), (u'paul', 8), (u'joe', 9), (u'james', 10), (u'george', 10), (u'mary', 10), (u'jim', 11), (u'jack', 12), (u'tom', 12), (u'sam', 14), (u'frank', 15), (u'john', 16)]


_What are the shortest and longest movies? What is the average run time of all the movies??_

In [5]:
proj = { '$project': {'_id': 0, 'title': 1, 'duration': 1}}
for s in [-1, 1]:
    d = collection.aggregate([{'$sort': {'duration': s}}, proj, {'$limit': 1}])
    for r in d:
        print '{}: {}'.format(r['title'], r['duration'])
group = {'$group': { '_id': 'Average Movie Duration', 'dur': {'$avg': '$duration'}}}
avg_movie_duration = collection.aggregate([group])
for r in avg_movie_duration:
    print '{}: {} minutes'.format(r['_id'], r['dur'])

Gangs of Wasseypur: 320
The General: 67
Average Movie Duration: 128.964 minutes


_What were the least and most expensive movies to produce?_

In [3]:
movie_titles = collection.find({}, {'_id': 0, 'title': 1, 'box_office.budget': 1})
cheapest = [movie_titles[0]['title'], movie_titles[0]['box_office']['budget']]
most_expensive = [movie_titles[0]['title'], movie_titles[0]['box_office']['budget']]
for r in movie_titles:
    b = (r['box_office']['budget'])
    t = r['title']
    if b == None:
        continue
    if b < cheapest[1]:
        cheapest = [t,b]
    if b > most_expensive[1]:
        most_expensive = [t,b]
print '{}: {}'.format(*cheapest)
print '{}: {}'.format(*most_expensive)

Bicycle Thieves: 133000
Captain America: Civil War: 250000000


_What were the highest and lowest grossing films??_

In [4]:
movie_titles = collection.find({}, {'_id': 0, 'title': 1, 'box_office.gross': 1})
lowest = [movie_titles[0]['title'], movie_titles[0]['box_office']['gross']]
highest = [movie_titles[0]['title'], movie_titles[0]['box_office']['gross']]
for r in movie_titles:
    b = (r['box_office']['gross'])
    t = r['title']
    if b == None:
        continue
    if b < lowest[1]:
        lowest = [t,b]
    if b > highest[1]:
        highest = [t,b]
print '{}: {}'.format(*lowest)
print '{}: {}'.format(*highest)

All About Eve: 10177
Star Wars: The Force Awakens: 936627416


Unfortunately these numbers don't really tell us much. Some of the values haven't been updated in a long time. Plus, the numbers aren't adjusted for inflation.

_How many movies were produced in each decade??_ Decades will start on the 1 and end on a 0, so the 1950s would include movies from 1951-1960, 1960s would include 1961-1970.

In [36]:
movie_titles = collection.find({}, {'_id': 0, 'title': 1, 'date_published': 1})
d = {}
for r in movie_titles:
    v = r['date_published'][:4]
    if v[3] == '0':
        v = str(int(v) - 1)[:3] + '0'
    else:
        v = v[:3] + '0'
    if v in d:
        d[v] += 1
    else:
        d[v] = 1
for (y, c) in sorted(d.items(), key=lambda x:x[0]):
    print '{}: {}'.format(y, c)

1920: 5
1930: 9
1940: 12
1950: 28
1960: 17
1970: 26
1980: 26
1990: 42
2000: 53
2010: 32


The increase in top rated movies throughout the decades (Come on 1960s, you ruined my chart!) makes sense for many reasons.
* The number of movies being released each decade has increased steadily throughout the years.
* The younger generation is much more prone to watching newer films over older films.
* The younger generation is also much more likely to vote on a website because they are more technologically savvy

So that brings me to another question, what happened in the 1960s??

A rough search on the IMDB gives us 20,000 titles in the 1950s. The 1960s has almost 28,000. So the decline can't be attributed to less films being produced.

* While televisions had been broadcasting since the 1920s, their popularity started to increase in the late 1940s. By 1950, there were 6 million televisions in the United States. By 1960, over 50 million! With the increase in the popularity of the television, movie attendance decreased drastically. Film companies began producing more content for television. The decrease in movie attendance coincided with the financial difficulties of the movie industry as a whole.

* With the movie industry on the decline, many companies, including Warner Bros, United Artists, and Paramount, were bought out by business conglomerates who were more interested in making money quickly rather than producing quality.

* Many of the famous directors from the 1950s and earlier were on the decline as well. John Ford, Howard Hawks, George Stevens, Alfred Hitchcock (although we did get The Birds and Topaz and the 1960s), William Wyler, and others. This would bring on a new breed of directors, who moviegoers weren't familiar with and perhaps affected the overall theater attendance.

## Resources

* [Film History of the 1960s](http://www.filmsite.org/60sintro.html)

* [Number of Televisions in the US](http://hypertextbook.com/facts/2007/TamaraTamazashvili.shtml)