https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

The dataset was pulled from Kaggle.com (an open source data science website). The user who uploaded the data, scraped content from IMDB's website. The user chose to store the movie title, budget, earnings, genre list, average rating, and the popularity score (calculated by the movie page views). This data is interesting because it looks at a large number of movies and includes data that reports on relative financial success, movie rating, and movie popularity. A movie producer could show potential investors how a new movie idea would correlate to existing successful movies while estimating basic trends for financial success, ratings, and popularity.

Questions to explore: Is there a correlation between budget and revenue? Is there a relationship between gross profits and genre type? Is there a relationship between spoken languages (dubs) and gross profit?

In [7]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import calendar
from matplotlib.pyplot import subplots, show
import ast
%matplotlib inline

moviedata = pd.read_csv('tmdb_5000_movies.csv', encoding = "ISO-8859-1")

def find_set(series, unique=[]):
    for fulllist in series:
        fulllist = ast.literal_eval(fulllist)
        for fulldict in fulllist:
            if 'name' in fulldict:
                unique.append(fulldict['name'])          
    return set(unique)

def name_fix(fulllist):
    try:
        glist=[]
        fulllist = ast.literal_eval(fulllist)
        for fulldict in fulllist:
            if 'name' in fulldict:
                glist.append(fulldict['name'])
        return ', '.join(glist)
    except (SyntaxError, ValueError, TypeError) as e:
        return fulllist
    
un_genre = find_set(moviedata.genres)
un_country = find_set(moviedata.production_countries)
un_language = find_set(moviedata.spoken_languages)
un_keywords = find_set(moviedata.keywords)
print("COUNTRIES: {}, GENRES: {}, LANGUAGES: {}".format(len(un_country),len(un_genre),len(un_language)))

for column in moviedata:
    moviedata[column] = moviedata[column].apply(name_fix)    
moviedata['gross'] = moviedata['revenue'] - moviedata['budget']

print("BUDGET MAX: ${:,.2f} REVENUE MAX: ${:,.2f} GROSSING MAX: ${:,.2f}".format(moviedata['budget'].max(),moviedata['revenue'].max(),moviedata['gross'].max()))    
print("BUDGET MEAN: ${:,.2f} REVENUE MEAN: ${:,.2f} GROSSING MEAN: ${:,.2f}".format(moviedata['budget'].mean(),moviedata['revenue'].mean(),moviedata['gross'].mean()))    
       
moviedata.head()

COUNTRIES: 108, GENRES: 20, LANGUAGES: 170
BUDGET MAX: $380,000,000.00 REVENUE MAX: $2,787,965,087.00 GROSSING MAX: $2,550,965,087.00
BUDGET MEAN: $29,045,039.88 REVENUE MEAN: $82,260,638.65 GROSSING MEAN: $53,215,598.78


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,gross
0,237000000,"Action, Adventure, Fantasy, Science Fiction",http://www.avatarmovie.com/,19995,"culture clash, future, space war, space colony...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"Ingenious Film Partners, Twentieth Century Fox...",...,1987,2787965087,162.0,"English, Español",Released,Enter the World of Pandora.,Avatar,7.2,11800,2550965087
1,300000000,"Adventure, Fantasy, Action",http://disney.go.com/disneypictures/pirates/,285,"ocean, drug abuse, exotic island, east india t...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"Walt Disney Pictures, Jerry Bruckheimer Films,...",...,2007-05-19,961000000,169.0,English,Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,661000000
2,245000000,"Action, Adventure, Crime",http://www.sonypictures.com/movies/spectre/,206647,"spy, based on novel, secret agent, sequel, mi6...",en,Spectre,A cryptic message from Bondâs past sends him...,107.376788,"Columbia Pictures, Danjaq, B24",...,1979,880674609,148.0,"Français, English, Español, Italiano, Deutsch",Released,A Plan No One Escapes,Spectre,6.3,4466,635674609
3,250000000,"Action, Crime, Drama, Thriller",http://www.thedarkknightrises.com/,49026,"dc comics, crime fighter, terrorist, secret id...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"Legendary Pictures, Warner Bros., DC Entertain...",...,2012-07-16,1084939099,165.0,English,Released,The Legend Ends,The Dark Knight Rises,7.6,9106,834939099
4,260000000,"Action, Adventure, Science Fiction",http://movies.disney.com/john-carter,49529,"based on novel, mars, medallion, space travel,...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,Walt Disney Pictures,...,2012-03-07,284139100,132.0,English,Released,"Lost in our world, found in another.",John Carter,6.1,2124,24139100


There are ~4800 movies, 20 genre types, 108 countries, and ~170 languages.

The highest budget movie is 380,000,000.00 dollars. Highest revenue is 2,787,965,087.00 dollars. The highest grossing was 2,550,965,087.00 (Avatar).

The average budget was 29,045,039.88 dollars. The average revenue was 82,260,638.65 dollars and the average gross was 53,215,598.78 dollars.

In [8]:
def genre_check(value):
    for member in un_genre:
        if member in value:
            print('{} found in {}'.format(member,value))
            return True
        else:
            return False

for column in un_genre:
    moviedata[column] = moviedata.genres.apply(lambda value: column in value)
    
del moviedata['genres']

intcolumns = ['budget','popularity','revenue','runtime','vote_average','gross']

In [9]:
dfSeries = {}

for col in intcolumns:
    poplist = []
    for genre in un_genre:
        poplist.append(moviedata[genre] == True)

    serlist = []
    for ser in poplist:
        serlist.append(moviedata.loc[ser, col].mean())
        
    dfSeries[col] = serlist
    
dfGenre = pd.DataFrame(dfSeries)
dfGenre.index = un_genre

dfGenre

Unnamed: 0,budget,gross,popularity,revenue,runtime,vote_average
Family,50719510.0,111626000.0,27.832849,162345500.0,97.298246,6.02963
Music,15907950.0,32548000.0,13.101512,48455950.0,109.924324,6.355676
Mystery,30744490.0,47556440.0,24.586827,78300930.0,109.591954,6.183908
Drama,20678320.0,31437910.0,17.764853,52116230.0,113.314895,6.388594
Foreign,658088.4,-293436.9,0.686787,364651.5,110.617647,6.352941
Fantasy,63560610.0,129793600.0,36.387043,193354200.0,107.278302,6.096698
Animation,66465900.0,159227100.0,38.813439,225693000.0,89.923077,6.341453
War,35282460.0,48873420.0,23.777289,84155870.0,131.833333,6.713889
Science Fiction,51865550.0,100591000.0,36.451806,152456500.0,107.478505,6.005607
Western,27078700.0,19167260.0,18.236279,46245960.0,117.353659,6.178049
