# Microsoft Movie Studios

Author: Mario Mocombe

**Overview**

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Microsoft sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create.

Questions to consider:

* What are the business's plan points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?

## Data Understanding

Note that this data may not reflect the most up-to-date box office information.


1) im.db.zip 

    A zipped SQLite database containing movie data from the website Internet Movie Data Base.  The most relevant tables are         movie_basics and movie_ratings.

2) bom.movie_gross.csv.gz

    A compressed CSV file containing box office data from the website Box Office Mojo.

Questions to consider:

* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?


In [147]:
##Import Standard Packages
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# 1 IMDB DATABASE

In [148]:
### making a connection with the IMDB DATABASE using SQLite3
conn = sqlite3.connect('zippedData/im.db')

In [149]:
### setting up a cursor so I'm able to move through the database.
### Let's continue on and create a cursor.
##A cursor object is what can actually execute SQL commands. You create it by calling .cursor() on the connection.

cur = conn.cursor()
# (This is a special query for finding the table names. 
cur.execute("""SELECT name FROM sqlite_master WHERE type = 'table';""")

<sqlite3.Cursor at 0x1ec2a445570>

In [150]:
## Use the fetchall method to find out the table names
## Fetch the result and store it in table_names
table_names = cur.fetchall()
table_names

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

In [151]:
pd.read_sql("SELECT * FROM movie_basics;", conn)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


In [152]:
pd.read_sql("SELECT * FROM movie_ratings ORDER BY movie_id;", conn)

Unnamed: 0,movie_id,averagerating,numvotes
0,tt0063540,7.0,77
1,tt0066787,7.2,43
2,tt0069049,6.9,4517
3,tt0069204,6.1,13
4,tt0100275,6.5,119
...,...,...,...
73851,tt9913084,6.2,6
73852,tt9914286,8.7,136
73853,tt9914642,8.5,8
73854,tt9914942,6.6,5


In [153]:
#########KEEP#########################

s = """
SELECT primary_title, start_year, runtime_minutes, genres, averagerating, numvotes 
FROM movie_basics
JOIN movie_ratings
USING(movie_id)
ORDER BY numvotes DESC;
"""
imdb = pd.read_sql(s, conn)
#######################################

In [154]:
imdb

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066
1,The Dark Knight Rises,2012,164.0,"Action,Thriller",8.4,1387769
2,Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334
3,Django Unchained,2012,165.0,"Drama,Western",8.4,1211405
4,The Avengers,2012,143.0,"Action,Adventure,Sci-Fi",8.1,1183655
...,...,...,...,...,...,...
73851,Columbus,2018,85.0,Comedy,5.8,5
73852,BADMEN with a good behavior,2018,87.0,"Comedy,Horror",9.2,5
73853,July Kaatril,2019,,Romance,9.0,5
73854,Swarm Season,2019,86.0,Documentary,6.2,5


In [146]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   primary_title    73856 non-null  object 
 1   start_year       73856 non-null  int64  
 2   runtime_minutes  66236 non-null  float64
 3   genres           73052 non-null  object 
 4   averagerating    73856 non-null  float64
 5   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 3.4+ MB


In [123]:
imdb.dropna(inplace=True)

In [124]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65720 entries, 0 to 73854
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   primary_title    65720 non-null  object 
 1   start_year       65720 non-null  int64  
 2   runtime_minutes  65720 non-null  float64
 3   genres           65720 non-null  object 
 4   averagerating    65720 non-null  float64
 5   numvotes         65720 non-null  int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 3.5+ MB


In [126]:
imdb.head(60)

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066
1,The Dark Knight Rises,2012,164.0,"Action,Thriller",8.4,1387769
2,Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334
3,Django Unchained,2012,165.0,"Drama,Western",8.4,1211405
4,The Avengers,2012,143.0,"Action,Adventure,Sci-Fi",8.1,1183655
5,The Wolf of Wall Street,2013,180.0,"Biography,Crime,Drama",8.2,1035358
6,Shutter Island,2010,138.0,"Mystery,Thriller",8.1,1005960
7,Guardians of the Galaxy,2014,121.0,"Action,Adventure,Comedy",8.1,948394
8,Deadpool,2016,108.0,"Action,Adventure,Comedy",8.0,820847
9,The Hunger Games,2012,142.0,"Action,Adventure,Sci-Fi",7.2,795227


In [127]:
imdb.describe()

Unnamed: 0,start_year,runtime_minutes,averagerating,numvotes
count,65720.0,65720.0,65720.0,65720.0
mean,2014.258065,94.732273,6.320902,3954.674
std,2.600143,209.377017,1.458878,32088.23
min,2010.0,3.0,1.0,5.0
25%,2012.0,81.0,5.5,16.0
50%,2014.0,91.0,6.5,62.0
75%,2016.0,104.0,7.3,352.0
max,2019.0,51420.0,10.0,1841066.0


## Data Preparation
Describe and justify the process for preparing the data for analysis.

Questions to consider:

* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?

In [128]:
#Add > 25000 votes as a condition?

imdb.tail(64000)

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes
1720,Fading Gigolo,2013,90.0,Comedy,6.2,22473
1721,The Guernsey Literary and Potato Peel Pie Society,2018,124.0,"Drama,History,Romance",7.4,22443
1722,The Silence,2019,90.0,"Horror,Thriller",5.2,22399
1723,Better Watch Out,2016,89.0,"Comedy,Crime,Horror",6.5,22367
1724,Paranormal Activity: The Ghost Dimension,2015,88.0,"Horror,Mystery,Thriller",4.6,22361
...,...,...,...,...,...,...
73848,The Winter Garden's Tale,2018,75.0,"Documentary,Drama",7.6,5
73850,The Projectionist,2019,81.0,Documentary,7.0,5
73851,Columbus,2018,85.0,Comedy,5.8,5
73852,BADMEN with a good behavior,2018,87.0,"Comedy,Horror",9.2,5


In [157]:
imdb.tail(64000)

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes
9856,The Source Family,2012,98.0,"Documentary,Music",6.9,951
9857,Kill or Be Killed,2015,103.0,"Horror,Mystery,Thriller",4.4,951
9858,Michael Inside,2017,96.0,Drama,7.2,951
9859,The Little House,2014,136.0,"Drama,Romance",7.3,950
9860,Spa Night,2016,93.0,Drama,5.9,950
...,...,...,...,...,...,...
73851,Columbus,2018,85.0,Comedy,5.8,5
73852,BADMEN with a good behavior,2018,87.0,"Comedy,Horror",9.2,5
73853,July Kaatril,2019,,Romance,9.0,5
73854,Swarm Season,2019,86.0,Documentary,6.2,5


In [156]:
imdb.value_counts()

primary_title                              start_year  runtime_minutes  genres                averagerating  numvotes
Šiška Deluxe                               2015        108.0            Comedy,Drama          6.3            384         1
Goodbye to All That                        2014        87.0             Comedy,Drama,Romance  5.2            2141        1
Grace                                      2014        95.0             Drama                 6.0            176         1
                                           2011        98.0             Crime,Drama,Horror    6.5            19          1
Grabbers                                   2012        94.0             Comedy,Horror,Sci-Fi  6.3            15727       1
                                                                                                                        ..
Revelation: Dawn of Global Government      2016        106.0            Documentary           6.8            65          1
Revelation Trail     

In [130]:
imdb.duplicated().value_counts()

False    65720
dtype: int64

In [231]:
####SELECT FROM WHERE
    ###Genre LIKE '%Action%'

In [241]:
###DEL combined_df = pd.concat([x, y], axis=1, join='inner')

In [243]:
###  combined_df

# ################################################################

In [131]:
##TEST TEST#########---HIGHEST GROSSING ACTOR/ACTRESS---#####################

q = """
SELECT primary_title, runtime_minutes, genres, category, primary_name, averagerating, numvotes 
FROM principals
JOIN movie_ratings
USING (movie_id)
JOIN movie_basics
USING (movie_id)
JOIN persons
USING (person_id)
WHERE category = "actress"
OR category = "actor"
ORDER BY numvotes DESC;
"""
imdb2 = pd.read_sql(q, conn)
#######################################

In [None]:
#### add where num votes > 25000

In [132]:
imdb2.head(60)

Unnamed: 0,primary_title,runtime_minutes,genres,category,primary_name,averagerating,numvotes
0,Inception,148.0,"Action,Adventure,Sci-Fi",actor,Leonardo DiCaprio,8.8,1841066
1,Inception,148.0,"Action,Adventure,Sci-Fi",actor,Joseph Gordon-Levitt,8.8,1841066
2,Inception,148.0,"Action,Adventure,Sci-Fi",actress,Ellen Page,8.8,1841066
3,Inception,148.0,"Action,Adventure,Sci-Fi",actor,Ken Watanabe,8.8,1841066
4,The Dark Knight Rises,164.0,"Action,Thriller",actor,Christian Bale,8.4,1387769
5,The Dark Knight Rises,164.0,"Action,Thriller",actor,Tom Hardy,8.4,1387769
6,The Dark Knight Rises,164.0,"Action,Thriller",actress,Anne Hathaway,8.4,1387769
7,The Dark Knight Rises,164.0,"Action,Thriller",actor,Gary Oldman,8.4,1387769
8,Interstellar,169.0,"Adventure,Drama,Sci-Fi",actor,Matthew McConaughey,8.6,1299334
9,Interstellar,169.0,"Adventure,Drama,Sci-Fi",actress,Anne Hathaway,8.6,1299334


In [133]:
imdb2

Unnamed: 0,primary_title,runtime_minutes,genres,category,primary_name,averagerating,numvotes
0,Inception,148.0,"Action,Adventure,Sci-Fi",actor,Leonardo DiCaprio,8.8,1841066
1,Inception,148.0,"Action,Adventure,Sci-Fi",actor,Joseph Gordon-Levitt,8.8,1841066
2,Inception,148.0,"Action,Adventure,Sci-Fi",actress,Ellen Page,8.8,1841066
3,Inception,148.0,"Action,Adventure,Sci-Fi",actor,Ken Watanabe,8.8,1841066
4,The Dark Knight Rises,164.0,"Action,Thriller",actor,Christian Bale,8.4,1387769
...,...,...,...,...,...,...,...
248295,Casting Chloe,,"Comedy,Drama",actress,Sarah Jones Dittmeier,8.2,5
248296,Casting Chloe,,"Comedy,Drama",actress,Annabelle Fox,8.2,5
248297,American Dope: Acid Dreams,53.0,Documentary,actor,Alan Bradley,8.2,5
248298,American Dope: Acid Dreams,53.0,Documentary,actor,Michael Levine,8.2,5


In [134]:
####TEST####
imdb2['primary_name'].value_counts()

Eric Roberts               122
Brahmanandam                74
Prakash Raj                 74
Tom Sizemore                61
Michael Madsen              59
                          ... 
Feristah Senem Yildirim      1
Eduard Alexandre             1
Jiameng Yu                   1
Renay Allen                  1
Justin Meeks                 1
Name: primary_name, Length: 142527, dtype: int64

In [135]:
####TEST####
imdb2['genres'].value_counts().head(20)

Drama                   48535
Comedy                  24363
Horror                  12110
Comedy,Drama            11008
Thriller                 6608
Drama,Romance            6301
Documentary              6247
Comedy,Romance           5339
Comedy,Drama,Romance     5056
Action                   4502
Horror,Thriller          4225
Drama,Thriller           4140
Romance                  3094
Comedy,Horror            2551
Action,Crime,Drama       2399
Crime,Drama,Thriller     2093
Crime,Drama              2079
Family                   2015
Drama,Family             1980
Action,Drama             1738
Name: genres, dtype: int64

In [None]:
conn.close()

# 2 BOX OFFICE MOJO

In [177]:
##Loading up the first dataframe, BOX OFFICE MOJO, with Pandas.Importing Data
# Import the file and print the first 5 rows
bom = pd.read_csv("zippedData/bom.movie_gross.csv.gz")
bom

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [178]:
bom.info()
### CHANGE FOREIGN GROSS TO FLOAT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [179]:
## bom['foreign_gross'].astype(float)
##Gives Error: could not convert string to float: '1,131.6'

In [180]:
## bom['foreign_gross'].astype(int)
##Gives Error: cannot convert float NaN to integer

In [181]:
### sorting the values by domestic gross, we see that the foreign gross is off for 3 of the top results and for Furious 7.
bom.sort_values(by=['domestic_gross'],ascending=False)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.0,1131.6,2015
3080,Black Panther,BV,700100000.0,646900000,2018
3079,Avengers: Infinity War,BV,678800000.0,1369.5,2018
1873,Jurassic World,Uni.,652300000.0,1019.4,2015
727,Marvel's The Avengers,BV,623400000.0,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [182]:
### sorting the values by foreign gross, we see that the top 5 results are wildly popular franchises with a little over $1k gross.
bom.sort_values(by=['foreign_gross'])

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
2760,The Fate of the Furious,Uni.,226000000.0,1010.0,2017
1873,Jurassic World,Uni.,652300000.0,1019.4,2015
1872,Star Wars: The Force Awakens,BV,936700000.0,1131.6,2015
1874,Furious 7,Uni.,353000000.0,1163.0,2015
3079,Avengers: Infinity War,BV,678800000.0,1369.5,2018
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [183]:
## REPLACE the erroneous values with more realistic ones.  The five entries were most likely meant to contain billions in gross.
bom['foreign_gross'] = bom['foreign_gross'].replace(['1,010.0','1,019.4','1,131.6', '1,163.0','1,369.5'], ['1010000000', '1019000000', '1131000000', '1163000000', '1369000000'])

In [184]:
#check to see if the values changed
bom.sort_values(by=['domestic_gross'],ascending=False)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.0,1131000000,2015
3080,Black Panther,BV,700100000.0,646900000,2018
3079,Avengers: Infinity War,BV,678800000.0,1369000000,2018
1873,Jurassic World,Uni.,652300000.0,1019000000,2015
727,Marvel's The Avengers,BV,623400000.0,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [185]:
bom.dropna(inplace=True)

In [186]:
bom

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3275,I Still See You,LGF,1400.0,1500000,2018
3286,The Catcher Was a Spy,IFC,725000.0,229000,2018
3309,Time Freak,Grindstone,10000.0,256000,2018
3342,Reign of Judges: Title of Liberty - Concept Short,Darin Southa,93200.0,5200,2018


In [187]:
#### convert 'domestic gross' column from object to float########
bom['domestic_gross'] = bom['domestic_gross'].astype(int)

In [188]:
#### convert 'foreign gross' column from object to float########
bom['foreign_gross'] = bom['foreign_gross'].astype(int)

In [189]:
bom.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2007 entries, 0 to 3353
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           2007 non-null   object
 1   studio          2007 non-null   object
 2   domestic_gross  2007 non-null   int32 
 3   foreign_gross   2007 non-null   int32 
 4   year            2007 non-null   int64 
dtypes: int32(2), int64(1), object(2)
memory usage: 78.4+ KB


In [190]:
bom['total_gross'] = bom['domestic_gross'] + bom['foreign_gross']

In [191]:
bom

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,total_gross
0,Toy Story 3,BV,415000000,652000000,2010,1067000000
1,Alice in Wonderland (2010),BV,334200000,691300000,2010,1025500000
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000,664300000,2010,960300000
3,Inception,WB,292600000,535700000,2010,828300000
4,Shrek Forever After,P/DW,238700000,513900000,2010,752600000
...,...,...,...,...,...,...
3275,I Still See You,LGF,1400,1500000,2018,1501400
3286,The Catcher Was a Spy,IFC,725000,229000,2018,954000
3309,Time Freak,Grindstone,10000,256000,2018,266000
3342,Reign of Judges: Title of Liberty - Concept Short,Darin Southa,93200,5200,2018,98400


In [192]:
bom.sort_values(by=['total_gross'],ascending=False)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,total_gross
1872,Star Wars: The Force Awakens,BV,936700000,1131000000,2015,2067700000
3079,Avengers: Infinity War,BV,678800000,1369000000,2018,2047800000
1873,Jurassic World,Uni.,652300000,1019000000,2015,1671300000
727,Marvel's The Avengers,BV,623400000,895500000,2012,1518900000
1874,Furious 7,Uni.,353000000,1163000000,2015,1516000000
...,...,...,...,...,...,...
711,I'm Glad My Mother is Alive,Strand,8700,13200,2011,21900
322,The Thorn in the Heart,Osci.,7400,10500,2010,17900
1110,Cirkus Columbia,Strand,3500,9500,2012,13000
715,Aurora,CGld,5700,5100,2011,10800


In [193]:
bom = bom.drop(['studio', 'year'], axis=1)

In [194]:
bom.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2007 entries, 0 to 3353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           2007 non-null   object
 1   domestic_gross  2007 non-null   int32 
 2   foreign_gross   2007 non-null   int32 
 3   total_gross     2007 non-null   int32 
dtypes: int32(3), object(1)
memory usage: 54.9+ KB


In [195]:
bom

Unnamed: 0,title,domestic_gross,foreign_gross,total_gross
0,Toy Story 3,415000000,652000000,1067000000
1,Alice in Wonderland (2010),334200000,691300000,1025500000
2,Harry Potter and the Deathly Hallows Part 1,296000000,664300000,960300000
3,Inception,292600000,535700000,828300000
4,Shrek Forever After,238700000,513900000,752600000
...,...,...,...,...
3275,I Still See You,1400,1500000,1501400
3286,The Catcher Was a Spy,725000,229000,954000
3309,Time Freak,10000,256000,266000
3342,Reign of Judges: Title of Liberty - Concept Short,93200,5200,98400


In [201]:
bom.describe()

Unnamed: 0,domestic_gross,foreign_gross,total_gross
count,2007.0,2007.0,2007.0
mean,47019840.0,78626460.0,125646300.0
std,81626890.0,148080400.0,221199600.0
min,400.0,600.0,4900.0
25%,670000.0,4000000.0,8239000.0
50%,16700000.0,19700000.0,42400000.0
75%,56050000.0,77750000.0,133750000.0
max,936700000.0,1369000000.0,2067700000.0


In [202]:
bom.columns

Index(['title', 'domestic_gross', 'foreign_gross', 'total_gross'], dtype='object')

In [203]:
bom.isna().sum()

title             0
domestic_gross    0
foreign_gross     0
total_gross       0
dtype: int64

In [204]:
bom.dtypes

###change foreign_gross into float

title             object
domestic_gross     int32
foreign_gross      int32
total_gross        int32
dtype: object

In [205]:
bom.head(20)

Unnamed: 0,title,domestic_gross,foreign_gross,total_gross
0,Toy Story 3,415000000,652000000,1067000000
1,Alice in Wonderland (2010),334200000,691300000,1025500000
2,Harry Potter and the Deathly Hallows Part 1,296000000,664300000,960300000
3,Inception,292600000,535700000,828300000
4,Shrek Forever After,238700000,513900000,752600000
5,The Twilight Saga: Eclipse,300500000,398000000,698500000
6,Iron Man 2,312400000,311500000,623900000
7,Tangled,200800000,391000000,591800000
8,Despicable Me,251500000,291600000,543100000
9,How to Train Your Dragon,217600000,277300000,494900000


In [207]:
bom.shape

(2007, 4)

## JOINING DATAFRAMES

In [208]:
bom.set_index('title', inplace=True)

In [209]:
bom.head()

Unnamed: 0_level_0,domestic_gross,foreign_gross,total_gross
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Toy Story 3,415000000,652000000,1067000000
Alice in Wonderland (2010),334200000,691300000,1025500000
Harry Potter and the Deathly Hallows Part 1,296000000,664300000,960300000
Inception,292600000,535700000,828300000
Shrek Forever After,238700000,513900000,752600000


In [210]:
imdb.set_index('primary_title', inplace=True)

In [211]:
imdb

Unnamed: 0_level_0,start_year,runtime_minutes,genres,averagerating,numvotes
primary_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066
The Dark Knight Rises,2012,164.0,"Action,Thriller",8.4,1387769
Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334
Django Unchained,2012,165.0,"Drama,Western",8.4,1211405
The Avengers,2012,143.0,"Action,Adventure,Sci-Fi",8.1,1183655
...,...,...,...,...,...
Columbus,2018,85.0,Comedy,5.8,5
BADMEN with a good behavior,2018,87.0,"Comedy,Horror",9.2,5
July Kaatril,2019,,Romance,9.0,5
Swarm Season,2019,86.0,Documentary,6.2,5


In [213]:
joined_df = imdb.join(bom, how='inner')

joined_df

Unnamed: 0,start_year,runtime_minutes,genres,averagerating,numvotes,domestic_gross,foreign_gross,total_gross
'71,2014,99.0,"Action,Drama,Thriller",7.2,46103,1300000,355000,1655000
10 Cloverfield Lane,2016,103.0,"Drama,Horror,Mystery",7.2,260383,72100000,38100000,110200000
102 Not Out,2018,102.0,"Comedy,Drama",7.5,4802,1300000,10900000,12200000
11-11-11,2011,90.0,"Horror,Mystery,Thriller",4.0,11712,32800,5700000,5732800
12 Strong,2018,130.0,"Action,Drama,History",6.6,50155,45800000,21600000,67400000
...,...,...,...,...,...,...,...,...
Yves Saint Laurent,2014,106.0,"Biography,Drama",6.2,10311,724000,20300000,21024000
Zero Dark Thirty,2012,157.0,"Drama,Thriller",7.4,251072,95700000,37100000,132800000
Zookeeper,2011,102.0,"Comedy,Family,Romance",5.2,52396,80400000,89500000,169900000
Zoolander 2,2016,101.0,Comedy,4.7,59914,28800000,27900000,56700000


In [214]:
joined_df.sort_values(by=['numvotes'], ascending=False).head(60)

Unnamed: 0,start_year,runtime_minutes,genres,averagerating,numvotes,domestic_gross,foreign_gross,total_gross
Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066,292600000,535700000,828300000
The Dark Knight Rises,2012,164.0,"Action,Thriller",8.4,1387769,448100000,636800000,1084900000
Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334,188000000,489400000,677400000
Django Unchained,2012,165.0,"Drama,Western",8.4,1211405,162800000,262600000,425400000
The Wolf of Wall Street,2013,180.0,"Biography,Crime,Drama",8.2,1035358,116900000,275100000,392000000
Shutter Island,2010,138.0,"Mystery,Thriller",8.1,1005960,128000000,166800000,294800000
Guardians of the Galaxy,2014,121.0,"Action,Adventure,Comedy",8.1,948394,333200000,440200000,773400000
Deadpool,2016,108.0,"Action,Adventure,Comedy",8.0,820847,363100000,420000000,783100000
The Hunger Games,2012,142.0,"Action,Adventure,Sci-Fi",7.2,795227,408000000,286400000,694400000
Mad Max: Fury Road,2015,120.0,"Action,Adventure,Sci-Fi",8.1,780910,153600000,224800000,378400000


In [215]:
joined_df.sort_values(by=['total_gross'], ascending=False).head(60)

Unnamed: 0,start_year,runtime_minutes,genres,averagerating,numvotes,domestic_gross,foreign_gross,total_gross
Avengers: Infinity War,2018,149.0,"Action,Adventure,Sci-Fi",8.5,670926,678800000,1369000000,2047800000
Jurassic World,2015,124.0,"Action,Adventure,Sci-Fi",7.0,539338,652300000,1019000000,1671300000
Furious 7,2015,137.0,"Action,Crime,Thriller",7.2,335074,353000000,1163000000,1516000000
Avengers: Age of Ultron,2015,141.0,"Action,Adventure,Sci-Fi",7.3,665594,459000000,946400000,1405400000
Black Panther,2018,134.0,"Action,Adventure,Sci-Fi",7.3,516148,700100000,646900000,1347000000
Star Wars: The Last Jedi,2017,152.0,"Action,Adventure,Fantasy",7.1,462903,620200000,712400000,1332600000
Jurassic World: Fallen Kingdom,2018,128.0,"Action,Adventure,Sci-Fi",6.2,219125,417700000,891800000,1309500000
Frozen,2010,93.0,"Adventure,Drama,Sport",6.2,62311,400700000,875700000,1276400000
Frozen,2010,92.0,"Fantasy,Romance",5.4,75,400700000,875700000,1276400000
Frozen,2013,102.0,"Adventure,Animation,Comedy",7.5,516998,400700000,875700000,1276400000


In [None]:
#### MUST USE WHERE numvotes > 25000

# 3 THE NUMBERS

In [218]:
##Loading up the third dataframe, THE NUMBERS, with Pandas.
numbers = pd.read_csv("zippedData/tn.movie_budgets.csv.gz")
numbers

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


In [219]:
numbers = numbers.drop(['id', 'domestic_gross'], axis=1)

In [222]:
numbers.set_index('movie', inplace=True)

In [234]:
numbers.head()

Unnamed: 0_level_0,release_date,production_budget,worldwide_gross
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,"Dec 18, 2009","$425,000,000","$2,776,345,279"
Pirates of the Caribbean: On Stranger Tides,"May 20, 2011","$410,600,000","$1,045,663,875"
Dark Phoenix,"Jun 7, 2019","$350,000,000","$149,762,350"
Avengers: Age of Ultron,"May 1, 2015","$330,600,000","$1,403,013,963"
Star Wars Ep. VIII: The Last Jedi,"Dec 15, 2017","$317,000,000","$1,316,721,747"


In [235]:
numbers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5782 entries, Avatar to My Date With Drew
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       5782 non-null   object
 1   production_budget  5782 non-null   object
 2   worldwide_gross    5782 non-null   object
dtypes: object(3)
memory usage: 180.7+ KB


In [239]:
numbers['production_budget'] = numbers['production_budget'].astype(float)

ValueError: could not convert string to float: '$425,000,000'

In [236]:
numbers['roi'] = numbers['worldwide_gross'] - numbers['production_budget']

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [231]:
numbers

ValueError: invalid literal for int() with base 10: '$2,776,345,279'

In [56]:
numbers.shape

(5782, 6)

In [None]:
##### RELEASE DATES / WORLDWIDEGROSS

In [57]:
numbers.index

RangeIndex(start=0, stop=5782, step=1)

In [58]:
numbers.columns

Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')

In [59]:
numbers.isna().sum()

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

In [60]:
numbers.dtypes

id                    int64
release_date         object
movie                object
production_budget    object
domestic_gross       object
worldwide_gross      object
dtype: object

In [61]:
numbers.value_counts()

id   release_date  movie                     production_budget  domestic_gross  worldwide_gross
100  Sep 2, 2005   The Transporter 2         $32,000,000        $43,095,856     $88,978,458        1
34   Apr 30, 2010  Housefull                 $10,100,000        $1,183,658      $18,726,300        1
     Apr 5, 2019   The Best of Enemies       $10,000,000        $10,205,616     $10,205,616        1
     Aug 13, 2010  The Expendables           $82,000,000        $103,068,524    $268,268,174       1
     Aug 25, 2017  Birth of the Dragon       $31,000,000        $6,901,965      $7,220,490         1
                                                                                                  ..
67   Jun 15, 2005  Batman Begins             $150,000,000       $205,343,774    $359,142,722       1
     Jun 19, 1987  The Brave Little Toaster  $2,300,000         $0              $0                 1
     Jun 3, 1988   Big                       $18,000,000        $114,968,774    $151,668,774    

In [None]:
#################### VISUAL EXAMPLES################################
#import matplotlib.pyplot as plt

# Set up figure and axes
#fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
#fig.set_tight_layout(True)

# Histogram of Wins and Frequencies
#ax1.hist(x=wins, bins=range(8), align="left", color="green")
#ax1.set_xticks(range(7))
#ax1.set_xlabel("Wins in 2018 World Cup")
#ax1.set_ylabel("Frequency")
#ax1.set_title("Distribution of Wins")

## Horizontal Bar Graph of Wins by Country
#ax2.barh(teams[::-1], wins[::-1], color="green")
#ax2.set_xlabel("Wins in 2018 World Cup")
#ax2.set_title("Wins by Country");

##################################################################
# Set up figure
#fig, ax = plt.subplots(figsize=(8, 5))

# Basic scatter plot
#ax.scatter(
#    x=populations,
#    y=wins,
#    color="gray", alpha=0.5, s=100
#)
#ax.set_xlabel("2018 Population")
#ax.set_ylabel("2018 World Cup Wins")
#ax.set_title("Population vs. World Cup Wins")

# Add annotations for specific points of interest
#highlighted_points = {
#    "Belgium": 2, # Numbers are the index of that
#    "Brazil": 3,  # country in populations & wins
#    "France": 10,
#    "Nigeria": 17
#}
#for country, index in highlighted_points.items():
    # Get x and y position of data point
#    x = populations[index]
#    y = wins[index]
    # Move each point slightly down and to the left
    # (numbers were chosen by manually tweaking)
#    xtext = x - (1.25e6 * len(country))
#    ytext = y - 0.5
    # Annotate with relevant arguments
#    ax.annotate(
#        text=country,
#        xy=(x, y),
#        xytext=(xtext, ytext)
#    )

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

Questions to consider:

* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?

In [None]:
###usa_2016_gold_medals = []

##for row in olympics_data:
##    if row["Medal"] == "G" and row["Nationality"] == "USA" and row["Year"] == "2016":
##        usa_2016_gold_medals.append({"Event": row["Event"], "Name": row["Name"]})
        
## usa_2016_gold_medals

## Evaluation

Evaluate how well your work solves the stated business problem.

Questions to consider:

* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.


Questions to consider:

* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?