# What Makes Movies Earn More Money?
#### Stephanie Qie & Nikhil Pateel

## Introduction

Movies play an important role in today's popular culture and media, as they are one of the largest forms of entertainment and one of the most profitable industries in our society.

From large movie studios to movie analysts, many people have tried to analyze what makes a movie earn more money.

In this tutorial, we will ....


## 0.0 - Required Libraries

We need the following libraries
+ `pandas` - For storing data
+ `matplotlib` - For plotting data
+ `sci-kit learn` - has a variety of 

In [43]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm
import requests

## 1.0 Dataset Source

Our dataset comes from the following source
+ [kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata)

## 1.1 Grabbing Data

Load the .csv files which we downloaded from kaggle, and display the first few rows to have a general understanding of the data.

In [44]:
df = pd.read_csv('tmdb_5000_movies.csv')
cast = pd.read_csv('tmdb_5000_credits.csv')

display(df.head())
display(cast.head())

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## 1.3 Tidying Data

Joined the two tables by movie title, so that all of the data is in one table.

In [52]:
df = df.set_index('original_title').join(cast.set_index('title'))
display(df.head())

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,overview,popularity,production_companies,production_countries,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
#Horror,1500000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 9648, ""na...",http://www.hashtaghorror.com/,301325,[],de,"Inspired by actual events, a group of 12 year ...",2.815228,"[{""name"": ""AST Studios"", ""id"": 75277}, {""name""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,90.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Death is trending.,#Horror,3.3,52,301325.0,"[{""cast_id"": 0, ""character"": ""Alex's 12-Step F...","[{""credit_id"": ""545bbac70e0a261fb6002329"", ""de..."
$upercapitalist,0,"[{""id"": 53, ""name"": ""Thriller""}]",http://supercapitalist.net/,119458,[],en,A maverick New York hedge fund trader with unc...,0.174311,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,103.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Money for Life,Supercapitalist,3.5,2,,,
(500) Days of Summer,7500000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://500days.com,19913,"[{""id"": 248, ""name"": ""date""}, {""id"": 572, ""nam...",en,"Tom (Joseph Gordon-Levitt), greeting-card writ...",45.610993,"[{""name"": ""Fox Searchlight Pictures"", ""id"": 43...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,95.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,It was almost like falling in love.,(500) Days of Summer,7.2,2904,19913.0,"[{""cast_id"": 4, ""character"": ""Tom Hansen"", ""cr...","[{""credit_id"": ""52fe47f99251416c750abaa5"", ""de..."
...E tu vivrai nel terrore! L'aldilà,0,"[{""id"": 27, ""name"": ""Horror""}]",,19204,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 1706, ""n...",it,A young woman inherits an old hotel in Louisia...,8.022122,"[{""name"": ""Fulvia Film"", ""id"": 13682}]","[{""iso_3166_1"": ""IT"", ""name"": ""Italy""}]",...,87.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,The seven dreaded gateways to Hell are conceal...,The Beyond,6.6,117,,,
10 Cloverfield Lane,15000000,"[{""id"": 53, ""name"": ""Thriller""}, {""id"": 878, ""...",http://www.10cloverfieldlane.com/,333371,"[{""id"": 1930, ""name"": ""kidnapping""}, {""id"": 23...",en,"After a car accident, Michelle awakens to find...",53.698683,"[{""name"": ""Paramount Pictures"", ""id"": 4}, {""na...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,103.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Monsters come in many forms.,10 Cloverfield Lane,6.8,2468,333371.0,"[{""cast_id"": 2, ""character"": ""Michelle"", ""cred...","[{""credit_id"": ""57627624c3a3680682000872"", ""de..."


As seen in the table above, some of the columns (ie. genres, production companies, production countries, spoken languages, cast, and crew) have more than element per cell. We will be converting those cells into data structures to make it easier to access all of the data in those cells.

First we will modify the genres. As shown below, a movie can have multiple genres, and each genre contains an id and name. We will make it so that for each movie, all of their genres are stored in an array, and each genre element will be a dictionary with keys for its id and name.

In [46]:
df.iloc[0]["genres"]

'[{"id": 18, "name": "Drama"}, {"id": 9648, "name": "Mystery"}, {"id": 27, "name": "Horror"}, {"id": 53, "name": "Thriller"}]'

In [76]:
import re

#go through each row
for index, row in df.iterrows():
    
    genres = []
    line = row["genres"]
    cur_genres = line.split('},')
    
    del cur_genres[-1]
  
    #transform info for each genre
    for genre in cur_genres:
        m = re.search("(\d+), \"name\": \"(\w+)", genre)
        cur_id = m.groups()[0]
        cur_name = m.groups()[1]
        
        cur_info = {}
        cur_info["id"] = cur_id
        cur_info["name"] = cur_name
        
        genres.append(cur_info)
        
    row["genres"] = genres 

'[{"id": 18, "name": "Drama"}, {"id": 9648, "name": "Mystery"}, {"id": 27, "name": "Horror"}, {"id": 53, "name": "Thriller"}]'

Next we will modify the production companies. As shown below, a movie can have multiple production companies, and each production company has a name and id.

In [47]:
df.iloc[0]["production_companies"]

'[{"name": "AST Studios", "id": 75277}, {"name": "Lowland Pictures", "id": 75278}]'

Next we will modify the production countries. As shown below, a movie can have multiple production countries, and each production country has an iso_3166_1 and a name.

In [48]:
df.iloc[0]["production_countries"]

'[{"iso_3166_1": "US", "name": "United States of America"}]'

Next we will modify the spoken languages. As shown below, a movie can have multiple spoken languages, and each spoken language has an iso_639_1 and a name.

In [49]:
df.iloc[0]["spoken_languages"]

'[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\\u00f1ol"}]'

Next we will modify the cast. As shown below, a movie has multiple cast members, and each member has a cast_id, a character, a credit_id, a gender, an id, a name, and an order.

In [50]:
df.iloc[0]["cast"]

'[{"cast_id": 0, "character": "Alex\'s 12-Step Friend", "credit_id": "545bba84c3a3685358001b80", "gender": 1, "id": 343, "name": "Taryn Manning", "order": 1}, {"cast_id": 1, "character": "Sam\'s Mom", "credit_id": "545bba8a0e0a261fad0023f6", "gender": 1, "id": 10871, "name": "Natasha Lyonne", "order": 2}, {"cast_id": 2, "character": "Alex Cox", "credit_id": "545bba8fc3a36853500018a4", "gender": 1, "id": 2838, "name": "Chlo\\u00eb Sevigny", "order": 3}, {"cast_id": 3, "character": "Mr. Cox", "credit_id": "545bba94c3a3685353001a56", "gender": 2, "id": 9296, "name": "Balthazar Getty", "order": 4}, {"cast_id": 4, "character": "Dr. White", "credit_id": "545bba990e0a261fb900220b", "gender": 2, "id": 16327, "name": "Timothy Hutton", "order": 5}, {"cast_id": 5, "character": "Lisa", "credit_id": "545bba9ec3a368535d001e67", "gender": 1, "id": 210573, "name": "Lydia Hearst", "order": 6}, {"cast_id": 6, "character": "Mom", "credit_id": "545bbaa4c3a368535d001e6b", "gender": 0, "id": 180425, "name":

Next we will modify the crew. As shown below, each crew has multiple members, and each member has a credit_id, a department, a gender, an id, a job, and a name.

In [51]:
df.iloc[0]["crew"]

'[{"credit_id": "545bbac70e0a261fb6002329", "department": "Writing", "gender": 1, "id": 61111, "job": "Screenplay", "name": "Tara Subkoff"}, {"credit_id": "545bbabf0e0a261fb9002212", "department": "Directing", "gender": 1, "id": 61111, "job": "Director", "name": "Tara Subkoff"}, {"credit_id": "545bbae4c3a36853500018a8", "department": "Production", "gender": 1, "id": 61111, "job": "Producer", "name": "Tara Subkoff"}, {"credit_id": "545bbad3c3a3685358001b92", "department": "Production", "gender": 0, "id": 1382445, "job": "Producer", "name": "Jason Ludman"}, {"credit_id": "545bbadbc3a368535d001e74", "department": "Production", "gender": 0, "id": 1382446, "job": "Producer", "name": "Oren Segal"}, {"credit_id": "545bbaf3c3a3685358001b9d", "department": "Production", "gender": 0, "id": 1382448, "job": "Producer", "name": "Brendan Walsh"}]'

## 2.0 - Exploratory Data Analysis

## 3.0 - Analysis

## Insights