Research Question Ideas:

How does the popularity and reception of a movie relate to its budget/revenue?

In [263]:
''' 
https://www.kaggle.com/rounakbanik/the-movies-dataset
 
Loading datasets
'''


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [264]:
# Ran into EOF inside string error, set engine to python to avoid it

mov_metadata = pd.read_csv("movies_metadata.csv", engine='python')

In [265]:
mov_metadata.head()

'''
Need to filter out adult films (done)

Drop homepage/overview/poster_path/tagline belongs_to_collection/
production_companies/spoken_languages/original_language, original title, video 
(done)

Explore genres, id/imdb_id, popularity, production_countries, 
budget/revenue (done)

Drop unreleased movies (status column) (done)

Rearrange parameters for readability

On budget: Some extremely relevant films have been made on shoestring budgets
Eraserhead by David Lynch was done with $10,000 (forget filtering budget?)

Box office data from TMDB will be found in less frequency in movies 15+ years
old
"Movie grosses reporting isn't an exact science. Studios and distributors have 
started disclosing detailed figures only recently; the information for films 
released over 15 years ago are therefore very sketchy. The longer you go back 
in time, the less reliable the information becomes." - IMDB

Ideas for data exploration:
Budget vs Revenue exploration and inferences
Genre vs Budget/Revenue
Original or Spoken Language/Production Countries (analysis of data by location/culture)
Release Date (analyze data over time periods)
What are the most common runtimes?  How does runtime relate to budget?
Popularity
Vote Average/Vote Count (relate to budget, genre, release data, etc?)
'''

'\nNeed to filter out adult films (done)\n\nDrop homepage/overview/poster_path/tagline belongs_to_collection/\nproduction_companies/spoken_languages/original_language, original title, video \n(done)\n\nExplore genres, id/imdb_id, popularity, production_countries, \nbudget/revenue (done)\n\nDrop unreleased movies (status column) (done)\n\nRearrange parameters for readability\n\nOn budget: Some extremely relevant films have been made on shoestring budgets\nEraserhead by David Lynch was done with $10,000 (forget filtering budget?)\n\nBox office data from TMDB will be found in less frequency in movies 15+ years\nold\n"Movie grosses reporting isn\'t an exact science. Studios and distributors have \nstarted disclosing detailed figures only recently; the information for films \nreleased over 15 years ago are therefore very sketchy. The longer you go back \nin time, the less reliable the information becomes." - IMDB\n\nIdeas for data exploration:\nBudget vs Revenue exploration and inferences\n

In [266]:
mov_metadata.isnull().sum()

'''
A lot of these columns will be dropped anyways

Dropping unreleased and adult movies fixes a lot of our null issues, let's see
what things look like after
'''

"\nA lot of these columns will be dropped anyways\n\nDropping unreleased and adult movies fixes a lot of our null issues, let's see\nwhat things look like after\n"

In [267]:
# Dataframe shape

mov_metadata.shape

(45466, 24)

In [268]:
# Dropping irrelevant columns

mov_metadata = mov_metadata.drop([
                                  'homepage', 
                                  'overview', 
                                  'poster_path', 
                                  'tagline',
                                  'belongs_to_collection',
                                  'production_companies',
                                  'spoken_languages',
                                  'original_language',
                                  'original_title',
                                  'video',
                                  ], 
                                 axis=1)

In [269]:
# Looking at values for 'adult' and 'status' columns

print(mov_metadata['adult'].value_counts())
print(mov_metadata['status'].value_counts())

# We can get rid of all the movies that aren't released and adult movies

False                                                                                                                             45454
True                                                                                                                                  9
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: adult, dtype: int64
Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64


In [270]:
# Removing adult films, unreleased filmes

mov_metadata = mov_metadata.drop(mov_metadata['adult'][mov_metadata['adult']!='False'].index)
mov_metadata = mov_metadata.drop(mov_metadata['status'][mov_metadata['status']!='Released'].index)

In [271]:
mov_metadata = mov_metadata.drop(['adult', 'status'], axis=1)

In [272]:
mov_metadata.isnull().sum()

'''
15 missing IMDB IDs
78 missing release dates
251 missing runtimes

let's get a closer look at these observations
'''

"\n15 missing IMDB IDs\n78 missing release dates\n251 missing runtimes\n\nlet's get a closer look at these observations\n"

In [273]:
# A view of all the observations that are missing IMDB IDs

mov_metadata[mov_metadata['imdb_id'].isnull()==True]

Unnamed: 0,budget,genres,id,imdb_id,popularity,production_countries,release_date,revenue,runtime,title,vote_average,vote_count
8966,1000000,"[{'id': 80, 'name': 'Crime'}]",36337,,0.156722,"[{'iso_3166_1': 'US', 'name': 'United States o...",1991-06-07,0.0,100.0,Delusion,4.8,3.0
13757,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 53, 'nam...",200796,,0.149818,[],2003-03-19,0.0,,Show,6.3,2.0
13821,0,"[{'id': 10769, 'name': 'Foreign'}, {'id': 28, ...",75015,,0.202468,"[{'iso_3166_1': 'PL', 'name': 'Poland'}]",1970-04-06,0.0,73.0,How I Unleashed World War II Part III: Among F...,7.0,3.0
17382,2500000,"[{'id': 9648, 'name': 'Mystery'}, {'id': 53, '...",36663,,0.035294,[],,0.0,110.0,Dreamkiller,5.0,1.0
18959,0,"[{'id': 99, 'name': 'Documentary'}, {'id': 16,...",28500,,1.556352,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",2005-11-05,0.0,87.0,Before The Dinosaurs - Walking With Monsters,7.3,12.0
19322,0,[],118013,,1.233673,[],,0.0,98.0,Endeavour,6.6,19.0
20806,0,"[{'id': 16, 'name': 'Animation'}, {'id': 28, '...",15257,,5.539197,"[{'iso_3166_1': 'US', 'name': 'United States o...",2009-01-27,0.0,38.0,Hulk vs. Wolverine,6.8,48.0
20937,0,"[{'id': 37, 'name': 'Western'}, {'id': 28, 'na...",55576,,1.087671,"[{'iso_3166_1': 'US', 'name': 'United States o...",1997-01-19,0.0,0.0,Last Stand at Saber River,3.7,3.0
21916,0,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",293412,,0.465559,[],1995-01-01,0.0,0.0,Running Wild,10.0,1.0
23744,0,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",30146,,1.116784,"[{'iso_3166_1': 'JP', 'name': 'Japan'}]",2006-10-01,0.0,180.0,Gunbuster vs Diebuster Aim for the Top! The GA...,6.5,5.0


In [274]:
# I do some googling to find if a movie has an IMDB ID or not and fill in the
# data appropriately

mov_metadata.loc[8966, 'imdb_id']='tt0101704'
mov_metadata.loc[13757, 'imdb_id']='tt0358146'
mov_metadata.loc[18959, 'imdb_id']='tt0490048'
mov_metadata.loc[20937, 'imdb_id']='tt0119501'
mov_metadata.loc[21916, 'imdb_id']='tt0105298'
mov_metadata.loc[33753, 'imdb_id']='tt6618142'
mov_metadata.loc[36955, 'imdb_id']='tt1147516'
mov_metadata.loc[41832, 'imdb_id']='tt4699464'
mov_metadata.loc[45070, 'imdb_id']='tt6051554'

# These entries are smaller parts of bigger movies, not featured on IMDB and
# resulting in some missing data

mov_metadata = mov_metadata.drop(13821)
mov_metadata = mov_metadata.drop(20806)

# Couldn't find any info on these entries

mov_metadata = mov_metadata.drop(17382)
mov_metadata = mov_metadata.drop(19322)
mov_metadata = mov_metadata.drop(23744)
mov_metadata = mov_metadata.drop(40809)

In [275]:
mov_metadata[mov_metadata['budget']=='0']

Unnamed: 0,budget,genres,id,imdb_id,popularity,production_countries,release_date,revenue,runtime,title,vote_average,vote_count
2,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,11.7129,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,Grumpier Old Men,6.5,92.0
4,0,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,8.387519,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,Father of the Bride Part II,5.7,173.0
7,0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",45325,tt0112302,2.561161,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,97.0,Tom and Huck,5.4,45.0
11,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",12110,tt0112896,5.430331,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1995-12-22,0.0,88.0,Dracula: Dead and Loving It,5.7,210.0
12,0,"[{'id': 10751, 'name': 'Family'}, {'id': 16, '...",21032,tt0112453,12.140733,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,11348324.0,78.0,Balto,7.1,423.0
...,...,...,...,...,...,...,...,...,...,...,...,...
45461,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",439050,tt6209470,0.072051,"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,Subdue,4.0,1.0
45462,0,"[{'id': 18, 'name': 'Drama'}]",111109,tt2028550,0.178241,"[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,Century of Birthing,9.0,3.0
45463,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",67758,tt0303758,0.903007,"[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,Betrayal,3.8,6.0
45464,0,[],227506,tt0008536,0.003503,"[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,Satan Triumphant,0.0,0.0


In [276]:
# After doing some research it looks like a budget of 0 is simply unfound data,
# we should change these to NaNs to exclude them from analysis if necessary

mov_metadata.loc[mov_metadata['budget'] == '0','budget'] = np.NaN

In [277]:
# It looks like we only have good budget data for ~9000 movie entries or 20% of
# our dataset - but the budget data we DO have should still be relevant

mov_metadata['budget'].value_counts()

5000000     286
10000000    259
20000000    243
2000000     241
15000000    226
           ... 
33            1
4185000       1
2090000       1
280           1
9272437       1
Name: budget, Length: 1218, dtype: int64

In [278]:
mov_metadata[mov_metadata['revenue']==0]

Unnamed: 0,budget,genres,id,imdb_id,popularity,production_countries,release_date,revenue,runtime,title,vote_average,vote_count
2,,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,11.7129,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,Grumpier Old Men,6.5,92.0
6,58000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",11860,tt0114319,6.677277,"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",1995-12-15,0.0,127.0,Sabrina,6.2,141.0
7,,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",45325,tt0112302,2.561161,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,97.0,Tom and Huck,5.4,45.0
11,,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",12110,tt0112896,5.430331,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1995-12-22,0.0,88.0,Dracula: Dead and Loving It,5.7,210.0
21,,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",1710,tt0112722,10.701801,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-27,0.0,124.0,Copycat,6.5,199.0
...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",439050,tt6209470,0.072051,"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,Subdue,4.0,1.0
45462,,"[{'id': 18, 'name': 'Drama'}]",111109,tt2028550,0.178241,"[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,Century of Birthing,9.0,3.0
45463,,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",67758,tt0303758,0.903007,"[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,Betrayal,3.8,6.0
45464,,[],227506,tt0008536,0.003503,"[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,Satan Triumphant,0.0,0.0


In [279]:
mov_metadata.loc[mov_metadata['revenue'] == 0,'revenue'] = np.NaN

In [280]:
mov_metadata['runtime'].value_counts()

90.0      2530
0.0       1495
100.0     1457
95.0      1400
93.0      1202
          ... 
238.0        1
316.0        1
258.0        1
780.0        1
1256.0       1
Name: runtime, Length: 353, dtype: int64

In [281]:
mov_metadata[mov_metadata['runtime']==0]

Unnamed: 0,budget,genres,id,imdb_id,popularity,production_countries,release_date,revenue,runtime,title,vote_average,vote_count
222,,"[{'id': 53, 'name': 'Thriller'}]",61813,tt0112899,0.155859,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-01-01,,0.0,Dream Man,2.5,1.0
224,,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",62488,tt0112854,0.710671,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-04-28,,0.0,Destiny Turns on the Radio,5.3,9.0
398,,[],172923,tt0112889,0.233376,[],1995-05-26,,0.0,Dos Crímenes,5.0,1.0
554,,[],218473,tt0109226,0.38247,[],1994-01-01,,0.0,"The Beans of Egypt, Maine",0.0,1.0
667,,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",221917,tt0114307,0.117662,"[{'iso_3166_1': 'IE', 'name': 'Ireland'}, {'is...",1995-09-22,,0.0,The Run of the Country,8.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
45370,,"[{'id': 18, 'name': 'Drama'}]",374764,tt3529010,0.274793,"[{'iso_3166_1': 'AR', 'name': 'Argentina'}]",2015-11-12,,0.0,How Most Things Work,6.8,2.0
45371,,"[{'id': 16, 'name': 'Animation'}]",460135,tt7158814,8.413734,"[{'iso_3166_1': 'US', 'name': 'United States o...",2017-08-30,,0.0,LEGO DC Super Hero Girls: Brain Drain,10.0,2.0
45399,750000,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",280422,tt3805180,0.201582,"[{'iso_3166_1': 'RU', 'name': 'Russia'}]",2014-06-05,3.0,0.0,All at Once,6.0,4.0
45416,,"[{'id': 35, 'name': 'Comedy'}]",282308,tt0427767,0.003732,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1912-01-01,,0.0,"Whiffles, Cubic Artist",0.0,0.0


In [282]:
'''
Around 1500 movies missing their runtimes, not much I can do to fill in the
data with that much missing so we'll simply convert them to NaNs
These will get bundled together with the other observations that were already
NaNs and we can filter these out as needed
'''

mov_metadata.loc[mov_metadata['runtime'] == 0, 'runtime'] = np.NaN

In [283]:
# We probably want to turn these empty lists into NaNs

mov_metadata['genres'].value_counts()

[{'id': 18, 'name': 'Drama'}]                                                                                                                                      4941
[{'id': 35, 'name': 'Comedy'}]                                                                                                                                     3585
[{'id': 99, 'name': 'Documentary'}]                                                                                                                                2677
[]                                                                                                                                                                 2383
[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]                                                                                                    1292
                                                                                                                                                                

In [284]:
# Get those NaNs setup

mov_metadata.loc[mov_metadata['genres'] == '[]', 'genres'] = np.NaN

# We're left with 4044 observations with genre data, usable but a small subset
# of all our actual data

In [285]:
# We have a number of 0 values here, but they're actually relevant data in this
# context and don't necessarily need to be converted to NaNs

mov_metadata['popularity'].value_counts()

0.0         64
1e-06       56
0.000308    42
0.00022     39
0.000578    37
            ..
0.372215     1
0.541651     1
0.932111     1
0.19332      1
5.745206     1
Name: popularity, Length: 43347, dtype: int64

In [286]:
# Converting this column from string to float 

mov_metadata['popularity'] = np.array(mov_metadata['popularity'], dtype=np.float32)

In [287]:
mov_metadata['popularity'].max()

547.48828125

In [288]:
mov_metadata['production_countries'].value_counts()

[{'iso_3166_1': 'US', 'name': 'United States of America'}]                                                                                                            17730
[]                                                                                                                                                                     6146
[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]                                                                                                                       2228
[{'iso_3166_1': 'FR', 'name': 'France'}]                                                                                                                               1638
[{'iso_3166_1': 'JP', 'name': 'Japan'}]                                                                                                                                1349
                                                                                                                                            

In [289]:
mov_metadata.head()

Unnamed: 0,budget,genres,id,imdb_id,popularity,production_countries,release_date,revenue,runtime,title,vote_average,vote_count
0,30000000.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,21.946943,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,Toy Story,7.7,5415.0
1,65000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,17.015539,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,Jumanji,6.9,2413.0
2,,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,11.7129,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,,101.0,Grumpier Old Men,6.5,92.0
3,16000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,3.859495,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,Waiting to Exhale,6.1,34.0
4,,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,8.387519,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,Father of the Bride Part II,5.7,173.0


In [290]:
# Rearranging column labels for readability

mov_metadata = mov_metadata[[
                             'id', 
                             'imdb_id', 
                             'title', 
                             'release_date', 
                             'runtime', 
                             'production_countries', 
                             'genres', 'budget', 
                             'revenue', 
                             'popularity',
                             'vote_average',
                             'vote_count'
                             ]]

In [291]:
mov_metadata.head()

Unnamed: 0,id,imdb_id,title,release_date,runtime,production_countries,genres,budget,revenue,popularity,vote_average,vote_count
0,862,tt0114709,Toy Story,1995-10-30,81.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",30000000.0,373554033.0,21.946943,7.7,5415.0
1,8844,tt0113497,Jumanji,1995-12-15,104.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",65000000.0,262797249.0,17.015539,6.9,2413.0
2,15602,tt0113228,Grumpier Old Men,1995-12-22,101.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,,11.7129,6.5,92.0
3,31357,tt0114885,Waiting to Exhale,1995-12-22,127.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",16000000.0,81452156.0,3.859495,6.1,34.0
4,11862,tt0113041,Father of the Bride Part II,1995-02-10,106.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'id': 35, 'name': 'Comedy'}]",,76578911.0,8.387519,5.7,173.0


In [292]:
# Converting budget, revenue, and user vote to float so we can perform 
# operations using them

mov_metadata['budget'] = np.array(mov_metadata['budget'], dtype=np.float32)
mov_metadata['revenue'] = np.array(mov_metadata['revenue'], dtype=np.float32)
mov_metadata['vote_count'] = np.array(mov_metadata['vote_count'], dtype=np.float32)

In [293]:
# Engineering some useful features

mov_metadata['profit'] = mov_metadata['revenue'] - mov_metadata['budget']
mov_metadata['budget per minute of runtime'] = mov_metadata['budget'] / mov_metadata['runtime']
mov_metadata['budget per user vote'] = mov_metadata['budget'] / mov_metadata['vote_count'][mov_metadata['vote_count']!=0]
mov_metadata['revenue per minute of runtime'] = mov_metadata['revenue'] / mov_metadata['runtime']
mov_metadata['revenue per user vote'] = mov_metadata['revenue'] / mov_metadata['vote_count'][mov_metadata['vote_count']!=0]
mov_metadata['profit per minute of runtime'] = mov_metadata['profit'] / mov_metadata['runtime']
mov_metadata['profit per user vote'] = mov_metadata['profit'] / mov_metadata['vote_count'][mov_metadata['vote_count']!=0]

In [294]:
# Converting our features into floats with 2 decimal places

mov_metadata['budget per minute of runtime'] = np.array(['%.2f' % elem for elem in mov_metadata['budget per minute of runtime']], dtype=np.float32)
mov_metadata['budget per user vote'] = np.array(['%.2f' % elem for elem in mov_metadata['budget per user vote']], dtype=np.float32)
mov_metadata['revenue per minute of runtime'] = np.array(['%.2f' % elem for elem in mov_metadata['revenue per minute of runtime']], dtype=np.float32)
mov_metadata['revenue per user vote'] = np.array(['%.2f' % elem for elem in mov_metadata['revenue per user vote']], dtype=np.float32)
mov_metadata['profit per minute of runtime'] = np.array(['%.2f' % elem for elem in mov_metadata['profit per minute of runtime']], dtype=np.float32)
mov_metadata['profit per user vote'] = np.array(['%.2f' % elem for elem in mov_metadata['profit per user vote']], dtype=np.float32)

In [295]:
# Gathering some data insights

print("Runtime:\n",
      "Runtime Mean:\n", mov_metadata['runtime'].mean(), '\n',
      "Runtime Standard Deviation:\n", mov_metadata['runtime'].std(), '\n',
      "Runtime Maximum:\n", mov_metadata['runtime'].max(), '\n',
      "Runtime Minimum:\n", mov_metadata['runtime'].min(), '\n')

print("Budget:\n",
      "Budget Mean:\n", mov_metadata['budget'].mean(), '\n',
      "Budget Standard Deviation:\n", mov_metadata['budget'].std(), '\n',
      "Budget Maximum:\n", mov_metadata['budget'].max(), '\n',
      "Budget Minimum:\n", mov_metadata['budget'].min(), '\n')

print("Revenue:\n",
      "Revenue Mean:\n", mov_metadata['revenue'].mean(), '\n',
      "Revenue Standard Deviation:\n", mov_metadata['revenue'].std(), '\n',
      "Revenue Maximum:\n", mov_metadata['revenue'].max(), '\n',
      "Revenue Minimum:\n", mov_metadata['revenue'].min(), '\n')

print("Popularity:\n",
      "Popularity Mean:\n", mov_metadata['popularity'].mean(), '\n',
      "Popularity Standard Deviation:\n", mov_metadata['popularity'].std(), '\n',
      "Popularity Maximum:\n", mov_metadata['popularity'].max(), '\n',
      "Popularity Minimum:\n", mov_metadata['popularity'].min(), '\n')

print("Vote Average:\n",
      "Vote Average Mean:\n", mov_metadata['vote_average'].mean(), '\n',
      "Vote Average Standard Deviation:\n", mov_metadata['vote_average'].std(), '\n',
      "Vote Average Maximum:\n", mov_metadata['vote_average'].max(), '\n',
      "Vote Average Minimum:\n", mov_metadata['vote_average'].min(), '\n')

print("Vote Count:\n",
      "Vote Count Mean:\n", mov_metadata['vote_count'].mean(), '\n',
      "Vote Count Standard Deviation:\n", mov_metadata['vote_count'].std(), '\n',
      "Vote Count Maximum:\n", mov_metadata['vote_count'].max(), '\n',
      "Vote Count Minimum:\n", mov_metadata['vote_count'].min(), '\n')

print("Budget Per Minute of Runtime:\n",
      "Budget Per Minute of Runtime:\n", mov_metadata['budget per minute of runtime'].mean(), '\n',
      "Budget Per Minute of Runtime Standard Deviation:\n", mov_metadata['budget per minute of runtime'].std(), '\n',
      "Budget Per Minute of Runtime Maximum:\n", mov_metadata['budget per minute of runtime'].max(), '\n',
      "Budget Per Minute of Runtime Minimum:\n", mov_metadata['budget per minute of runtime'].min(), '\n')

print("Budget Per User Vote:\n",
      "Budget Per User Vote Mean:\n", mov_metadata['budget per user vote'].mean(), '\n',
      "Budget Per User Vote Standard Deviation:\n", mov_metadata['budget per user vote'].std(), '\n',
      "Budget Per User Vote Maximum:\n", mov_metadata['budget per user vote'].max(), '\n',
      "Budget Per User Vote Minimum:\n", mov_metadata['budget per user vote'].min(), '\n')

print("Revenue Per Minute of Runtime:\n",
      "Revenue Per Minute of Runtime Mean:\n", mov_metadata['revenue per minute of runtime'].mean(), '\n',
      "Revenue Per Minute of Runtime Standard Deviation:\n", mov_metadata['revenue per minute of runtime'].std(), '\n',
      "Revenue Per Minute of Runtime Maximum:\n", mov_metadata['revenue per minute of runtime'].max(), '\n',
      "Revenue Per Minute of Runtime Minimum:\n", mov_metadata['revenue per minute of runtime'].min(), '\n')

print("Revenue Per User Vote:\n",
      "Revenue Per User Vote Mean:\n", mov_metadata['revenue per user vote'].mean(), '\n',
      "Revenue Per User Vote Standard Deviation:\n", mov_metadata['revenue per user vote'].std(), '\n',
      "Revenue Per User Vote Maximum:\n", mov_metadata['revenue per user vote'].max(), '\n',
      "Revenue Per User Vote Minimum:\n", mov_metadata['revenue per user vote'].min(), '\n')

print("Profit Per Minute of Runtime:\n",
      "Profit Per Minute of Runtime Mean:\n", mov_metadata['profit per minute of runtime'].mean(), '\n',
      "Profit Per Minute of Runtime Standard Deviation:\n", mov_metadata['profit per minute of runtime'].std(), '\n',
      "Profit Per Minute of Runtime Maximum:\n", mov_metadata['profit per minute of runtime'].max(), '\n',
      "Profit Per Minute of Runtime Minimum:\n", mov_metadata['profit per minute of runtime'].min(), '\n')

print("Profit Per User Vote:\n",
      "Profit Per User Vote Mean:\n", mov_metadata['profit per user vote'].mean(), '\n',
      "Profit Per User Vote Standard Deviation:\n", mov_metadata['profit per user vote'].std(), '\n',
      "Profit Per User Vote Maximum:\n", mov_metadata['profit per user vote'].max(), '\n',
      "Profit Per User Vote Minimum:\n", mov_metadata['profit per user vote'].min(), '\n')

Runtime:
 Runtime Mean:
 97.53248254496694 
 Runtime Standard Deviation:
 34.703852558828466 
 Runtime Maximum:
 1256.0 
 Runtime Minimum:
 1.0 

Budget:
 Budget Mean:
 21661888.0 
 Budget Standard Deviation:
 34346320.0 
 Budget Maximum:
 380000000.0 
 Budget Minimum:
 1.0 

Revenue:
 Revenue Mean:
 68899032.0 
 Revenue Standard Deviation:
 146524368.0 
 Revenue Maximum:
 2787965184.0 
 Revenue Minimum:
 1.0 

Popularity:
 Popularity Mean:
 2.9394726753234863 
 Popularity Standard Deviation:
 6.024213790893555 
 Popularity Maximum:
 547.48828125 
 Popularity Minimum:
 0.0 

Vote Average:
 Vote Average Mean:
 5.623877777777487 
 Vote Average Standard Deviation:
 1.915971028966491 
 Vote Average Maximum:
 10.0 
 Vote Average Minimum:
 0.0 

Vote Count:
 Vote Count Mean:
 110.9212875366211 
 Vote Count Standard Deviation:
 493.7283630371094 
 Vote Count Maximum:
 14075.0 
 Vote Count Minimum:
 0.0 

Budget Per Minute of Runtime:
 Budget Per Minute of Runtime:
 196841.6875 
 Budget Per Mi

In [296]:
'''
The analysis we did above gives all sorts of useful insights:

Runtime averages somewhere around 60min to 135min - 
this is about what I'd expect

Budget averages around 21 million but with such a huge std this data isn't as
robust, I expected there to be a wide variety of budgets with such an 
expansive dataset

Revenue averages around 68 million but with a much greater standard deviation
than budget - I think this indicates how wildly the revenue of a movie can
vary regardless of budget

Vote average has our average user giving the average movie a 5.6/10 with an 
std of 1.9 - so we're seeing scores like 3.7-7.5 often

Vote count has an average of 110 with an std of 493 - another key point here
is the maximum of 14075 - I think this indicates a disproportionality with 
blockbusters receiving huge amounts of votes and then the rest of cinema 
receiving a much smaller share of attention

We have a lot of engineered features here, but with high std values -
we'll save these features for when we need them
'''

"\nThe analysis we did above gives all sorts of useful insights:\n\nRuntime averages somewhere around 60min to 135min - \nthis is about what I'd expect\n\nBudget averages around 21 million but with such a huge std this data isn't as\nrobust, I expected there to be a wide variety of budgets with such an \nexpansive dataset\n\nRevenue averages around 68 million but with a much greater standard deviation\nthan budget - I think this indicates how wildly the revenue of a movie can\nvary regardless of budget\n\nVote average has our average user giving the average movie a 5.6/10 with an \nstd of 1.9 - so we're seeing scores like 3.7-7.5 often\n\nVote count has an average of 110 with an std of 493 - another key point here\nis the maximum of 14075 - I think this indicates a disproportionality with \nblockbusters receiving huge amounts of votes and then the rest of cinema \nreceiving a much smaller share of attention\n\nWe have a lot of engineered features here, but with high std values -\nwe'll

In [297]:
fig = px.scatter(mov_metadata, 
                 x='vote_average', 
                 y='revenue', 
                 size='vote_count', 
                 hover_data=['title'],
                 labels={
                     'vote_average':'Average User Score',
                     'revenue':'Revenue',},
                 title='User Score vs Revenue',)

fig.show()

In [298]:
fig = px.scatter(mov_metadata, 
                 x='vote_count', 
                 y='revenue',
                 color='vote_average',  
                 hover_data=['title'],
                 labels={
                     'vote_count':'Amount of User Votes',
                     'revenue':'Revenue',
                     'vote_average':'Average User Score'},
                 title='Amount of User Votes vs Revenue')

fig.show()

In [299]:
fig = px.scatter(mov_metadata, 
                 x='vote_count', 
                 y='vote_average',
                 title='Amount of User Votes vs Average Score',
                 labels={
                     'vote_count':'Amount of User Votes',
                     'vote_average':'Average Score',},
                 hover_data=['title'])

fig.show()