![example](images/director_shot.jpeg)

# Phase 1 Project

**Authors:** Jonathan Holt
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Here you run your code to explore the data
import glob, os
fpath = 'zippedData/'
os.listdir(fpath)

['imdb.title.crew.csv.gz',
 'tmdb.movies.csv.gz',
 'imdb.title.akas.csv.gz',
 'imdb.title.ratings.csv.gz',
 'imdb.name.basics.csv.gz',
 'rt.reviews.tsv.gz',
 'imdb.title.basics.csv.gz',
 'rt.movie_info.tsv.gz',
 'tn.movie_budgets.csv.gz',
 'bom.movie_gross.csv.gz',
 'imdb.title.principals.csv.gz']

In [3]:
query = fpath+"*.gz"

file_list=glob.glob(query)
file_list

['zippedData/imdb.title.crew.csv.gz',
 'zippedData/tmdb.movies.csv.gz',
 'zippedData/imdb.title.akas.csv.gz',
 'zippedData/imdb.title.ratings.csv.gz',
 'zippedData/imdb.name.basics.csv.gz',
 'zippedData/rt.reviews.tsv.gz',
 'zippedData/imdb.title.basics.csv.gz',
 'zippedData/rt.movie_info.tsv.gz',
 'zippedData/tn.movie_budgets.csv.gz',
 'zippedData/bom.movie_gross.csv.gz',
 'zippedData/imdb.title.principals.csv.gz']

In [4]:
tables = {}

for file in file_list:
    print('---'*20)
    file_name = file.replace('zippedData/', '').replace('.', '_')
    print(file_name)
    
    
    
    if 'tsv.gz' in file:
        temp_df = pd.read_csv(file, sep= "\t", encoding = "latin-1")
    else:
        temp_df = pd.read_csv(file)
    
    display(temp_df.head(), temp_df.tail())
    tables[file_name] = temp_df 

------------------------------------------------------------
imdb_title_crew_csv_gz


Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


Unnamed: 0,tconst,directors,writers
146139,tt8999974,nm10122357,nm10122357
146140,tt9001390,nm6711477,nm6711477
146141,tt9001494,"nm10123242,nm10123248",
146142,tt9004986,nm4993825,nm4993825
146143,tt9010172,,nm8352242


------------------------------------------------------------
tmdb_movies_csv_gz


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


------------------------------------------------------------
imdb_title_akas_csv_gz


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
331698,tt9827784,2,Sayonara kuchibiru,,,original,,1.0
331699,tt9827784,3,Farewell Song,XWW,en,imdbDisplay,,0.0
331700,tt9880178,1,La atención,,,original,,1.0
331701,tt9880178,2,La atención,ES,,,,0.0
331702,tt9880178,3,The Attention,XWW,en,imdbDisplay,,0.0


------------------------------------------------------------
imdb_title_ratings_csv_gz


Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


Unnamed: 0,tconst,averagerating,numvotes
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5
73855,tt9894098,6.3,128


------------------------------------------------------------
imdb_name_basics_csv_gz


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
606643,nm9990381,Susan Grobes,,,actress,
606644,nm9990690,Joo Yeon So,,,actress,"tt9090932,tt8737130"
606645,nm9991320,Madeline Smith,,,actress,"tt8734436,tt9615610"
606646,nm9991786,Michelle Modigliani,,,producer,
606647,nm9993380,Pegasus Envoyé,,,"director,actor,writer",tt8743182


------------------------------------------------------------
rt_reviews_tsv_gz


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


------------------------------------------------------------
imdb_title_basics_csv_gz


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,
146143,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,,Documentary


------------------------------------------------------------
rt_movie_info_tsv_gz


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


------------------------------------------------------------
tn_movie_budgets_csv_gz


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


------------------------------------------------------------
bom_movie_gross_csv_gz


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


------------------------------------------------------------
imdb_title_principals_csv_gz


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


Unnamed: 0,tconst,ordering,nconst,category,job,characters
1028181,tt9692684,1,nm0186469,actor,,"[""Ebenezer Scrooge""]"
1028182,tt9692684,2,nm4929530,self,,"[""Herself"",""Regan""]"
1028183,tt9692684,3,nm10441594,director,,
1028184,tt9692684,4,nm6009913,writer,writer,
1028185,tt9692684,5,nm10441595,producer,producer,


In [5]:
#Formating Cells. I am keeping all of my display commands here so I can easily find them if/when I need
#to change anything.

#change the amount of rows displayed
pd.set_option('display.max_rows', 1000)

In [6]:
#This takes the decimal places out of floats. I may need to change this for some of the other features.
pd.options.display.float_format = '{:,.0f}'.format

In [7]:
#attempting to sort by the largest worldwide_gross

tables['tn_movie_budgets_csv_gz'].sort_values(by='worldwide_gross', ascending=False)

#something is wrong here. I assume that it is sorting anything with a '9' in the front first.
#I presume that the dollar amounts are strings instead of integers. I should also check for
#null values while I'm checking for this.

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
3737,38,"Aug 21, 2009",Fifty Dead Men Walking,"$10,000,000",$0,"$997,921"
3432,33,"Sep 30, 2005",Duma,"$12,000,000","$870,067","$994,790"
5062,63,"Apr 1, 2011",Insidious,"$1,500,000","$54,009,150","$99,870,886"
883,84,"Apr 2, 2004",Hellboy,"$60,000,000","$59,623,958","$99,823,958"
5613,14,"Mar 21, 1980",Mad Max,"$200,000","$8,750,000","$99,750,000"
...,...,...,...,...,...,...
5488,89,"Dec 31, 2014",The Sound and the Shadow,"$500,000",$0,$0
5487,88,"Dec 1, 2015",Brooklyn Bizarre,"$500,000",$0,$0
5486,87,"Aug 11, 2015",Alleluia! The Devil's Carnival,"$500,000",$0,$0
5485,86,"Jun 23, 2015",Crossroads,"$500,000",$0,$0


In [8]:
#set alias for each table so that it is easier to use them.
table1 = tables['imdb_title_crew_csv_gz']
table2 = tables['tmdb_movies_csv_gz']
table3 = tables['imdb_title_akas_csv_gz']
table4 = tables['imdb_title_ratings_csv_gz']
table5 = tables['imdb_name_basics_csv_gz']
table6 = tables['rt_reviews_tsv_gz']
table7 = tables['imdb_title_basics_csv_gz']
table8 = tables['rt_movie_info_tsv_gz']
table9 = tables['tn_movie_budgets_csv_gz']
table10 = tables['bom_movie_gross_csv_gz']
table11 = tables['imdb_title_principals_csv_gz']

In [9]:
table9.info()

#Sure enough, everything is a string except for the ID field. I will convert all of the financial information
#to integers (actually, I went back and changed to floats), and while I'm at it. I will also convert the
#release date field to date/time.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [10]:
table9.isna().sum()
#there are no null values, but there are likely placeholders

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

In [11]:
#cleaning the three financial columns. Removing the $, removing the comma, then converting to float.
cleaned_budget = table9['production_budget'].map(lambda x: x.replace('$',' '))
cleaned_budget

0        425,000,000
1        410,600,000
2        350,000,000
3        330,600,000
4        317,000,000
            ...     
5777           7,000
5778           6,000
5779           5,000
5780           1,400
5781           1,100
Name: production_budget, Length: 5782, dtype: object

In [12]:
cleaned_budget_2 = cleaned_budget.map(lambda x: x.replace(',',''))
cleaned_budget_2

0        425000000
1        410600000
2        350000000
3        330600000
4        317000000
           ...    
5777          7000
5778          6000
5779          5000
5780          1400
5781          1100
Name: production_budget, Length: 5782, dtype: object

In [13]:
cleaned_budget_3 = cleaned_budget_2.astype(float)
cleaned_budget_3

0      425,000,000
1      410,600,000
2      350,000,000
3      330,600,000
4      317,000,000
           ...    
5777         7,000
5778         6,000
5779         5,000
5780         1,400
5781         1,100
Name: production_budget, Length: 5782, dtype: float64

In [14]:
cleaned_domestic= table9['domestic_gross'].map(lambda x: x.replace('$',' '))

cleaned_domestic


0        760,507,625
1        241,063,875
2         42,762,350
3        459,005,868
4        620,181,382
            ...     
5777               0
5778          48,482
5779           1,338
5780               0
5781         181,041
Name: domestic_gross, Length: 5782, dtype: object

In [15]:
cleaned_domestic_2 =cleaned_domestic.map(lambda x: x.replace(',',''))
cleaned_domestic_2

0        760507625
1        241063875
2         42762350
3        459005868
4        620181382
           ...    
5777             0
5778         48482
5779          1338
5780             0
5781        181041
Name: domestic_gross, Length: 5782, dtype: object

In [16]:
cleaned_domestic_3 = cleaned_domestic_2.astype(float)
cleaned_domestic_3

0      760,507,625
1      241,063,875
2       42,762,350
3      459,005,868
4      620,181,382
           ...    
5777             0
5778        48,482
5779         1,338
5780             0
5781       181,041
Name: domestic_gross, Length: 5782, dtype: float64

In [17]:
cleaned_worldwide = table9['worldwide_gross'].map(lambda x: x.replace('$',' '))
cleaned_worldwide

0        2,776,345,279
1        1,045,663,875
2          149,762,350
3        1,403,013,963
4        1,316,721,747
             ...      
5777                 0
5778           240,495
5779             1,338
5780                 0
5781           181,041
Name: worldwide_gross, Length: 5782, dtype: object

In [18]:
cleaned_worldwide_2 = cleaned_worldwide.map(lambda x: x.replace(',',''))
cleaned_worldwide_2

0        2776345279
1        1045663875
2         149762350
3        1403013963
4        1316721747
           ...     
5777              0
5778         240495
5779           1338
5780              0
5781         181041
Name: worldwide_gross, Length: 5782, dtype: object

In [19]:
cleaned_worldwide_3 = cleaned_worldwide_2.astype(float)
cleaned_worldwide_3

0      2,776,345,279
1      1,045,663,875
2        149,762,350
3      1,403,013,963
4      1,316,721,747
            ...     
5777               0
5778         240,495
5779           1,338
5780               0
5781         181,041
Name: worldwide_gross, Length: 5782, dtype: float64

In [None]:
#This takes the decimal places out of floats. I may need to change this for some of the other features.
#pd.options.display.float_format = '{:,.0f}'.format

In [20]:
#Putting my cleaned data into the table.
table9["production_budget"] = cleaned_budget_3
table9["domestic_gross"] = cleaned_domestic_3
table9["worldwide_gross"] = cleaned_worldwide_3
table9


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,7000,0,0
5778,79,"Apr 2, 1999",Following,6000,48482,240495
5779,80,"Jul 13, 2005",Return to the Land of Wonders,5000,1338,1338
5780,81,"Sep 29, 2015",A Plague So Pleasant,1400,0,0


In [21]:
table9.sort_values(by='worldwide_gross', ascending=False)
#After all that work, I finally have the data sorted the way that I wanted.

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
42,43,"Dec 19, 1997",Titanic,200000000,659363944,2208208395
5,6,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220
6,7,"Apr 27, 2018",Avengers: Infinity War,300000000,678815482,2048134200
33,34,"Jun 12, 2015",Jurassic World,215000000,652270625,1648854864
...,...,...,...,...,...,...
5474,75,"Dec 31, 2005",Insomnia Manica,500000,0,0
5473,74,"Jul 17, 2012",Girls Gone Dead,500000,0,0
5472,73,"Apr 3, 2012",Enter Nowhere,500000,0,0
5471,72,"Dec 31, 2010",Drones,500000,0,0


In [22]:
#converting release_date to datetime format.
cleaned_release_date = pd.to_datetime(table9['release_date'])
cleaned_release_date

0      2009-12-18
1      2011-05-20
2      2019-06-07
3      2015-05-01
4      2017-12-15
          ...    
5777   2018-12-31
5778   1999-04-02
5779   2005-07-13
5780   2015-09-29
5781   2005-08-05
Name: release_date, Length: 5782, dtype: datetime64[ns]

In [23]:
table9['release_date']= cleaned_release_date
table9

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,425000000,760507625,2776345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
...,...,...,...,...,...,...
5777,78,2018-12-31,Red 11,7000,0,0
5778,79,1999-04-02,Following,6000,48482,240495
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338
5780,81,2015-09-29,A Plague So Pleasant,1400,0,0


In [24]:
#what is the range of release dates in this data set?
print("This is the earliest release date:")
print(table9['release_date'].min())

print("This is the latest release date:")
print(table9['release_date'].max())

This is the earliest release date:
1915-02-08 00:00:00
This is the latest release date:
2020-12-31 00:00:00


In [25]:
#there are several movies that have budget information, but no gross. Presumably, this is because they hadn't
#been released at the time that this data was collected. 
table9.sort_values(by='release_date', ascending=False)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
194,95,2020-12-31,Moonfall,150000000,0,0
1205,6,2020-12-31,Hannibal the Conqueror,50000000,0,0
535,36,2020-02-21,Call of the Wild,82000000,0,0
480,81,2019-12-31,Army of the Dead,90000000,0,0
3515,16,2019-12-31,Eli,11000000,0,0
...,...,...,...,...,...,...
5606,7,1925-11-19,The Big Parade,245000,11000000,22000000
5683,84,1920-09-17,Over the Hill to the Poorhouse,100000,3000000,3000000
5614,15,1916-12-24,"20,000 Leagues Under the Sea",200000,8000000,8000000
5523,24,1916-09-05,Intolerance,385907,0,0


In [26]:
year_of_release = table9['release_date']
year_of_release

0      2009-12-18
1      2011-05-20
2      2019-06-07
3      2015-05-01
4      2017-12-15
          ...    
5777   2018-12-31
5778   1999-04-02
5779   2005-07-13
5780   2015-09-29
5781   2005-08-05
Name: release_date, Length: 5782, dtype: datetime64[ns]

In [27]:
year_of_release = pd.DatetimeIndex(table9['release_date']).year
year_of_release

Int64Index([2009, 2011, 2019, 2015, 2017, 2015, 2018, 2007, 2017, 2015,
            ...
            2012, 1993, 2004, 2006, 2004, 2018, 1999, 2005, 2015, 2005],
           dtype='int64', name='release_date', length=5782)

In [28]:
#created a column 'release_year' to more easily search general release dates, etc.
table9['release_year'] = table9['release_date'].dt.year

In [29]:
table9

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,2019
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,2015
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017
...,...,...,...,...,...,...,...
5777,78,2018-12-31,Red 11,7000,0,0,2018
5778,79,1999-04-02,Following,6000,48482,240495,1999
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,2005
5780,81,2015-09-29,A Plague So Pleasant,1400,0,0,2015


In [31]:
films_released_by_year = table9['release_year']
films_released_by_year.value_counts()

2015    338
2010    274
2008    264
2006    260
2014    255
2011    254
2009    239
2013    238
2012    235
2005    223
2007    220
2016    219
2002    210
2004    206
2003    201
2000    189
2001    181
1999    181
2017    168
1998    151
2018    143
1996    104
1997    102
1995     75
2019     67
1994     56
1993     49
1991     39
1992     36
1987     35
1986     33
1989     32
1988     31
1981     30
1982     30
1990     30
1980     29
1985     29
1984     28
1983     24
1979     18
1977     15
1978     13
1976     11
1970     11
1974     11
1956     10
1969     10
1971      9
1968      9
1963      8
1967      8
1975      8
1962      8
1973      8
1964      7
1972      7
1965      6
1951      6
1966      6
1953      6
1961      5
1946      5
1960      5
1959      4
1940      4
1945      3
1939      3
1933      3
1936      3
2020      3
1954      3
1952      3
1957      3
1948      2
1944      2
1916      2
1925      2
1942      2
1949      2
1955      2
1943      2
1950      2
1938

In [32]:
#the bulk of the data starts in 1996 where there are 100+ movies every year afterward until 2018.
#there is a little bit of info for 2019.
films_released_by_year.head(10)

0    2009
1    2011
2    2019
3    2015
4    2017
5    2015
6    2018
7    2007
8    2017
9    2015
Name: release_year, dtype: int64

In [34]:
#let's sort by release year
table9.sort_values(by='release_year', ascending=False).head(50)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year
194,95,2020-12-31,Moonfall,150000000,0,0,2020
535,36,2020-02-21,Call of the Wild,82000000,0,0,2020
1205,6,2020-12-31,Hannibal the Conqueror,50000000,0,0,2020
2029,30,2019-09-30,Unhinged,29000000,0,0,2019
670,71,2019-08-30,PLAYMOBIL,75000000,0,0,2019
95,96,2019-03-08,Captain Marvel,175000000,426525952,1123061550,2019
1474,75,2019-05-03,Long Shot,40000000,30202860,43711031,2019
4132,33,2019-03-29,Unplanned,6000000,18107621,18107621,2019
4534,35,2019-06-07,Late Night,4000000,246305,246305,2019
4135,36,2019-02-08,The Prodigy,6000000,14856291,19789712,2019


In [36]:
table9.sort_values(by='worldwide_gross', ascending=False).head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009
42,43,1997-12-19,Titanic,200000000,659363944,2208208395,1997
5,6,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,2015
6,7,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,2018
33,34,2015-06-12,Jurassic World,215000000,652270625,1648854864,2015


In [37]:
#create a column that calculates the profit for each movie. (gross - budget)
#realize that there are other costs above and beyond the budget

total_profit = table9.apply(lambda x: x['worldwide_gross'] - x['production_budget'], axis=1)
total_profit

0      2,351,345,279
1        635,063,875
2       -200,237,650
3      1,072,413,963
4        999,721,747
            ...     
5777          -7,000
5778         234,495
5779          -3,662
5780          -1,400
5781         179,941
Length: 5782, dtype: float64

In [38]:
table9['total_profit'] = total_profit

In [39]:
table9.sort_values(by='total_profit', ascending=False).head(10)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,2351345279
42,43,1997-12-19,Titanic,200000000,659363944,2208208395,1997,2008208395
6,7,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,2018,1748134200
5,6,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,2015,1747311220
33,34,2015-06-12,Jurassic World,215000000,652270625,1648854864,2015,1433854864
66,67,2015-04-03,Furious 7,190000000,353007020,1518722794,2015,1328722794
26,27,2012-05-04,The Avengers,225000000,623279547,1517935897,2012,1292935897
260,61,2011-07-15,Harry Potter and the Deathly Hallows: Part II,125000000,381193157,1341693157,2011,1216693157
41,42,2018-02-16,Black Panther,200000000,700059566,1348258224,2018,1148258224
112,13,2018-06-22,Jurassic World: Fallen Kingdom,170000000,417719760,1305772799,2018,1135772799


In [40]:
#What is the return on investment for these movies?
roi = table9.apply(lambda x: x['total_profit'] / x['production_budget'], axis=1)
roi

0        6
1        2
2       -1
3        3
4        3
        ..
5777    -1
5778    39
5779    -1
5780    -1
5781   164
Length: 5782, dtype: float64

In [41]:
table9['ROI'] = roi
table9

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,2351345279,6
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,635063875,2
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,2019,-200237650,-1
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,2015,1072413963,3
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017,999721747,3
...,...,...,...,...,...,...,...,...,...
5777,78,2018-12-31,Red 11,7000,0,0,2018,-7000,-1
5778,79,1999-04-02,Following,6000,48482,240495,1999,234495,39
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,2005,-3662,-1
5780,81,2015-09-29,A Plague So Pleasant,1400,0,0,2015,-1400,-1


In [42]:
#A quick snapshot of the information. 
table9.describe()

Unnamed: 0,id,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI
count,5782,5782,5782,5782,5782,5782,5782
mean,50,31587757,41873327,91487461,2004,59899704,4
std,29,41812077,68240597,174719969,13,146088881,30
min,1,1100,0,0,1915,-200237650,-1
25%,25,5000000,1429534,4125415,2000,-2189071,-1
50%,50,17000000,17225945,27984448,2007,8550286,1
75%,75,40000000,52348662,97645836,2012,60968502,3
max,100,425000000,936662225,2776345279,2020,2351345279,1799


In [43]:
#sorted by ROI.
table9.sort_values(by='ROI', ascending=False).head(100)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI
5745,46,1972-06-30,Deep Throat,25000,45000000,45000000,1972,44975000,1799
5613,14,1980-03-21,Mad Max,200000,8750000,99750000,1980,99550000,498
5492,93,2009-09-25,Paranormal Activity,450000,107918810,194183034,2009,193733034,431
5679,80,2015-07-10,The Gallows,100000,22764410,41656474,2015,41556474,416
5406,7,1999-07-14,The Blair Witch Project,600000,140539099,248300000,1999,247700000,413
5709,10,2004-05-07,Super Size Me,65000,11529368,22233808,2004,22168808,341
5346,47,1942-08-13,Bambi,858000,102797000,268000000,1942,267142000,311
5773,74,1993-02-26,El Mariachi,7000,2040920,2041928,1993,2034928,291
5676,77,1968-10-01,Night of the Living Dead,114000,12087064,30087064,1968,29973064,263
5210,11,1976-11-21,Rocky,1000000,117235147,225000000,1976,224000000,224


In [None]:
#I am happy with where my financial data is at the moment, even though it will likely need further refinement.
#Now I want to find a way to indentify the genres of these movies as that will help me answer the specific
#question of what TYPES of movies should be made.

#After this, I plan on investigating review scores, giving me Financials, Review Scores, and Genres to work with.

In [56]:
#Table7 - imdb basics dataset has genre information. I will investigate.
table7

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116,


In [57]:
table7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [58]:
table7.sort_values(by='start_year', ascending=True).head(100)
#This data only goes back to 2010. This may still be useful as Microsoft will want current information on
#which genres are popular.

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
9599,tt1566491,Brainiacs in La La Land,Brainiacs in La La Land,2010,,Comedy
43264,tt2578092,Fireplace for your Home: Crackling Fireplace w...,Fireplace for your Home: Crackling Fireplace w...,2010,61.0,Music
11550,tt1634300,Role/Play,Role/Play,2010,85.0,"Drama,Romance"
11551,tt1634332,Johan1,Johan Primero,2010,78.0,"Comedy,Drama,Romance"
11552,tt1634334,Hands Up,Les mains en l'air,2010,90.0,Drama
11553,tt1634337,Que devient mon souvenir quand tu n'y penses pas,Que devient mon souvenir quand tu n'y penses pas,2010,45.0,Documentary
11554,tt1634519,The Black Eyed Peas: The E.N.D. World Tour Live,The Black Eyed Peas: The E.N.D. World Tour Live,2010,,"Documentary,Music"
11555,tt1634524,Jitters,Órói,2010,93.0,"Drama,Romance"
11556,tt1634540,Rescue Men: The Story of the Pea Island Lifesa...,Rescue Men: The Story of the Pea Island Lifesa...,2010,90.0,Documentary
11557,tt1634554,Janakan,Janakan,2010,150.0,"Crime,Thriller"


In [None]:
#TMDB also has genre information, although I would need to find out what each genre code relates to.
#Let me see how far back this data goes...

#table2 = tables['tmdb_movies_csv_gz']
#table2

In [None]:
#table2.info()

In [None]:
#table2.sort_values(by='release_date', ascending=True).head(100)
#This dataset has genre ids for all of its movies. It starts in 1930 and ends in mid-2019.

In [59]:
#I am going to merge several of the IMDB datasets together, using the tconst as a key.
imdb_df = pd.merge(table1, table4, left_on= 'tconst', right_on= 'tconst')
imdb_df

Unnamed: 0,tconst,directors,writers,averagerating,numvotes
0,tt0285252,nm0899854,nm0899854,4,219
1,tt0462036,nm1940585,nm1940585,6,18
2,tt0835418,nm0151540,"nm0310087,nm0841532",5,8147
3,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943,6,875
4,tt0879859,nm2416460,,7,21
...,...,...,...,...,...
73851,tt8947660,nm10097606,nm10097614,8,21
73852,tt8948614,"nm0827830,nm0839064",,7,696
73853,tt8954732,nm0737517,"nm0076820,nm2997602",6,13993
73854,tt8991416,nm7731173,nm7731173,7,13


In [60]:
imdb_df= pd.merge(imdb_df, table7, left_on= 'tconst', right_on= 'tconst')
imdb_df

Unnamed: 0,tconst,directors,writers,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy
1,tt0462036,nm1940585,nm1940585,6,18,Steve Phoenix: The Untold Story,Steve Phoenix: The Untold Story,2012,110,Drama
2,tt0835418,nm0151540,"nm0310087,nm0841532",5,8147,The Babymakers,The Babymakers,2012,95,Comedy
3,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943,6,875,Bulletface,Bulletface,2010,82,Thriller
4,tt0879859,nm2416460,,7,21,Torn,Torn,2010,,Thriller
...,...,...,...,...,...,...,...,...,...,...
73851,tt8947660,nm10097606,nm10097614,8,21,Goyenda Tatar,Goyenda Tatar,2019,,Adventure
73852,tt8948614,"nm0827830,nm0839064",,7,696,Reversing Roe,Reversing Roe,2018,99,Documentary
73853,tt8954732,nm0737517,"nm0076820,nm2997602",6,13993,The Princess Switch,The Princess Switch,2018,101,Romance
73854,tt8991416,nm7731173,nm7731173,7,13,Doozy,Doozy,2018,70,"Animation,Comedy"


In [61]:
imdb_df= pd.merge(imdb_df, table11, left_on= 'tconst', right_on= 'tconst')
imdb_df

Unnamed: 0,tconst,directors,writers,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,ordering,nconst,category,job,characters
0,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy,10,nm1077681,composer,,
1,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy,1,nm0960950,actor,,"[""Darren Fields""]"
2,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy,2,nm0461311,actor,,"[""RJ""]"
3,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy,3,nm0000686,actor,,"[""Roy Callahan""]"
4,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy,4,nm0001822,actor,,"[""Tom Wald""]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
629750,tt9004986,nm4993825,nm4993825,8,7,Syndebukken: Prosessen mot Harry Lindstrøm,Syndebukken: Prosessen mot Harry Lindstrøm,2018,,Documentary,5,nm4993825,director,,
629751,tt9004986,nm4993825,nm4993825,8,7,Syndebukken: Prosessen mot Harry Lindstrøm,Syndebukken: Prosessen mot Harry Lindstrøm,2018,,Documentary,6,nm9008590,producer,producer,
629752,tt9004986,nm4993825,nm4993825,8,7,Syndebukken: Prosessen mot Harry Lindstrøm,Syndebukken: Prosessen mot Harry Lindstrøm,2018,,Documentary,7,nm1145570,composer,,
629753,tt9004986,nm4993825,nm4993825,8,7,Syndebukken: Prosessen mot Harry Lindstrøm,Syndebukken: Prosessen mot Harry Lindstrøm,2018,,Documentary,8,nm0461754,cinematographer,,


In [62]:
#There are a lot of duplicates due to the personnel attached to each movie getting their own record.
# I doubt that I will need to check the personnel, so let's drop the duplicates.

imdb_df = imdb_df.drop_duplicates('original_title', keep='first')
imdb_df


Unnamed: 0,tconst,directors,writers,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,ordering,nconst,category,job,characters
0,tt0285252,nm0899854,nm0899854,4,219,Life's a Beach,Life's a Beach,2012,100,Comedy,10,nm1077681,composer,,
10,tt0462036,nm1940585,nm1940585,6,18,Steve Phoenix: The Untold Story,Steve Phoenix: The Untold Story,2012,110,Drama,10,nm0230187,production_designer,,
20,tt0835418,nm0151540,"nm0310087,nm0841532",5,8147,The Babymakers,The Babymakers,2012,95,Comedy,10,nm0790481,composer,,
30,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943,6,875,Bulletface,Bulletface,2010,82,Thriller,10,nm0480523,producer,producer,
40,tt0879859,nm2416460,,7,21,Torn,Torn,2010,,Thriller,10,nm1269186,editor,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
629707,tt8947660,nm10097606,nm10097614,8,21,Goyenda Tatar,Goyenda Tatar,2019,,Adventure,10,nm10377635,composer,,
629717,tt8948614,"nm0827830,nm0839064",,7,696,Reversing Roe,Reversing Roe,2018,99,Documentary,10,nm0324930,editor,,
629727,tt8954732,nm0737517,"nm0076820,nm2997602",6,13993,The Princess Switch,The Princess Switch,2018,101,Romance,10,nm0294520,composer,,
629737,tt8991416,nm7731173,nm7731173,7,13,Doozy,Doozy,2018,70,"Animation,Comedy",1,nm0571897,actor,,"[""Clovis""]"


In [63]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70944 entries, 0 to 629745
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tconst           70944 non-null  object 
 1   directors        70368 non-null  object 
 2   writers          60791 non-null  object 
 3   averagerating    70944 non-null  float64
 4   numvotes         70944 non-null  int64  
 5   primary_title    70944 non-null  object 
 6   original_title   70944 non-null  object 
 7   start_year       70944 non-null  int64  
 8   runtime_minutes  63654 non-null  float64
 9   genres           70246 non-null  object 
 10  ordering         70944 non-null  int64  
 11  nconst           70944 non-null  object 
 12  category         70944 non-null  object 
 13  job              7883 non-null   object 
 14  characters       26953 non-null  object 
dtypes: float64(2), int64(3), object(10)
memory usage: 8.7+ MB


In [65]:
imdb_df.sort_values('original_title').head(100)

Unnamed: 0,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres
3802,tt2346170,6,40,#1 Serial Killer,#1 Serial Killer,2013,87.0,Horror
272939,tt3120962,7,6,#5,#5,2013,68.0,"Biography,Comedy,Fantasy"
626374,tt5255986,5,18,#66,#66,2015,116.0,Action
618511,tt7853996,8,21,#ALLMYMOVIES,#ALLMYMOVIES,2015,,Documentary
584307,tt9844890,7,8,#AbroHilo,#AbroHilo,2019,52.0,Documentary
530073,tt6170868,7,23,#BKKY,#BKKY,2016,75.0,Drama
253747,tt5074174,9,31,#BeRobin the Movie,#BeRobin the Movie,2015,41.0,Documentary
451340,tt4353986,5,18,#Beings,#Beings,2015,56.0,Thriller
311045,tt6856592,3,212,#Captured,#Captured,2017,81.0,Thriller
463137,tt5803530,6,19,#DigitalLivesMatter,#DigitalLivesMatter,2016,,Comedy


In [None]:
imdb_df['imdb_title'] = 

In [64]:
#dropped the columns that weren't useful. (can always get them back from the original sources)
# I dropped the following columns: directors, writers, ordering, nconst, category, job, characters
imdb_df = imdb_df.drop(['directors', 'writers', 'ordering', 'nconst', 'category', 'job', 'characters'], axis = 1)
imdb_df.head(10)

Unnamed: 0,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0285252,4,219,Life's a Beach,Life's a Beach,2012,100.0,Comedy
10,tt0462036,6,18,Steve Phoenix: The Untold Story,Steve Phoenix: The Untold Story,2012,110.0,Drama
20,tt0835418,5,8147,The Babymakers,The Babymakers,2012,95.0,Comedy
30,tt0878654,6,875,Bulletface,Bulletface,2010,82.0,Thriller
40,tt0879859,7,21,Torn,Torn,2010,,Thriller
50,tt0996958,2,495,Legend of the Red Reaper,Legend of the Red Reaper,2013,99.0,"Action,Adventure,Fantasy"
60,tt0999913,6,30924,Straw Dogs,Straw Dogs,2011,110.0,"Action,Drama,Thriller"
70,tt10011102,8,55,The Sholay Girl,The Sholay Girl,2019,106.0,"Action,Biography,Drama"
80,tt1002965,8,31,Call of Life,Call of Life,2010,60.0,Documentary
90,tt10055770,8,380,Vellai Pookal,Vellaipookal,2019,122.0,Thriller


In [None]:
#I want to now merge this imdb_df with my financial information from table9. I know that if I just do a merge
# on primary title, some of the results will be incorrect. (Avatar is the top grossing movie but is listed as a horror movie)


In [68]:
#inner join with the indiciator column so I can track where everything is coming from.

merged_df = pd.merge(table9, imdb_df, left_on= 'movie', right_on= 'original_title', how='left', indicator=True)
merged_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,_merge
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,2351345279,6,,,,,,,,,left_only
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,635063875,2,tt1298650,7.0,447624.0,Pirates of the Caribbean: On Stranger Tides,Pirates of the Caribbean: On Stranger Tides,2011.0,136.0,"Action,Adventure,Fantasy",both
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,2019,-200237650,-1,tt6565702,6.0,24451.0,Dark Phoenix,Dark Phoenix,2019.0,113.0,"Action,Adventure,Sci-Fi",both
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,2015,1072413963,3,tt2395427,7.0,665594.0,Avengers: Age of Ultron,Avengers: Age of Ultron,2015.0,141.0,"Action,Adventure,Sci-Fi",both
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017,999721747,3,,,,,,,,,left_only


In [69]:
#5782 records after the keft join.
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5782 entries, 0 to 5781
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   float64       
 4   domestic_gross     5782 non-null   float64       
 5   worldwide_gross    5782 non-null   float64       
 6   release_year       5782 non-null   int64         
 7   total_profit       5782 non-null   float64       
 8   ROI                5782 non-null   float64       
 9   tconst             2123 non-null   object        
 10  averagerating      2123 non-null   float64       
 11  numvotes           2123 non-null   float64       
 12  primary_title      2123 non-null   object        
 13  original_title     2123 non-null   object        
 14  start_ye

In [71]:
merged_df.sort_values('total_profit', ascending=False).head(500)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,_merge
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,2351345279,6,,,,,,,,,left_only
42,43,1997-12-19,Titanic,200000000,659363944,2208208395,1997,2008208395,10,tt2495766,6.0,20.0,Titanic,Titanic,2012.0,,Adventure,both
6,7,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,2018,1748134200,6,tt4154756,8.0,670926.0,Avengers: Infinity War,Avengers: Infinity War,2018.0,149.0,"Action,Adventure,Sci-Fi",both
5,6,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,2015,1747311220,6,,,,,,,,,left_only
33,34,2015-06-12,Jurassic World,215000000,652270625,1648854864,2015,1433854864,7,tt0369610,7.0,539338.0,Jurassic World,Jurassic World,2015.0,124.0,"Action,Adventure,Sci-Fi",both
66,67,2015-04-03,Furious 7,190000000,353007020,1518722794,2015,1328722794,7,,,,,,,,,left_only
26,27,2012-05-04,The Avengers,225000000,623279547,1517935897,2012,1292935897,6,tt0848228,8.0,1183655.0,The Avengers,The Avengers,2012.0,143.0,"Action,Adventure,Sci-Fi",both
260,61,2011-07-15,Harry Potter and the Deathly Hallows: Part II,125000000,381193157,1341693157,2011,1216693157,10,,,,,,,,,left_only
41,42,2018-02-16,Black Panther,200000000,700059566,1348258224,2018,1148258224,6,tt1825683,7.0,516148.0,Black Panther,Black Panther,2018.0,134.0,"Action,Adventure,Sci-Fi",both
112,13,2018-06-22,Jurassic World: Fallen Kingdom,170000000,417719760,1305772799,2018,1135772799,7,tt4881806,6.0,219125.0,Jurassic World: Fallen Kingdom,Jurassic World: Fallen Kingdom,2018.0,128.0,"Action,Adventure,Sci-Fi",both


In [73]:
#Checking to see what happens if I merge on Primary_title
#merged_df_2 = pd.merge(table9, imdb_df, left_on= 'movie', right_on= 'primary_title', how='left', indicator=True)
#merged_df_2.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,_merge
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,2351345279,6,tt1775309,6.0,43.0,Avatar,Abatâ,2011.0,93.0,Horror,both
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,635063875,2,tt1298650,7.0,447624.0,Pirates of the Caribbean: On Stranger Tides,Pirates of the Caribbean: On Stranger Tides,2011.0,136.0,"Action,Adventure,Fantasy",both
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,2019,-200237650,-1,tt6565702,6.0,24451.0,Dark Phoenix,Dark Phoenix,2019.0,113.0,"Action,Adventure,Sci-Fi",both
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,2015,1072413963,3,tt2395427,7.0,665594.0,Avengers: Age of Ultron,Avengers: Age of Ultron,2015.0,141.0,"Action,Adventure,Sci-Fi",both
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017,999721747,3,,,,,,,,,left_only


In [75]:
#5978 records if I join this way.
merged_df_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5978 entries, 0 to 5977
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5978 non-null   int64         
 1   release_date       5978 non-null   datetime64[ns]
 2   movie              5978 non-null   object        
 3   production_budget  5978 non-null   float64       
 4   domestic_gross     5978 non-null   float64       
 5   worldwide_gross    5978 non-null   float64       
 6   release_year       5978 non-null   int64         
 7   total_profit       5978 non-null   float64       
 8   ROI                5978 non-null   float64       
 9   tconst             2369 non-null   object        
 10  averagerating      2369 non-null   float64       
 11  numvotes           2369 non-null   float64       
 12  primary_title      2369 non-null   object        
 13  original_title     2369 non-null   object        
 14  start_ye

In [77]:
#duplicates = merged_df_2[merged_df_2['movie'].duplicated()]
#print(len(duplicates))
#duplicates.head()

280


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,_merge
135,35,2017-03-17,Beauty and the Beast,160000000,504014165,1259199706,2017,1099199706,7,tt2771200,7,238325,Beauty and the Beast,Beauty and the Beast,2017,129,"Family,Fantasy,Musical",both
157,56,2013-11-22,Frozen,150000000,400738009,1272469910,2013,1122469910,7,tt1611845,5,75,Frozen,Wai nei chung ching,2010,92,"Fantasy,Romance",both
246,44,2015-03-27,Home,130000000,177397510,385997896,2015,255997896,2,tt4047846,7,811,Home,Home,2016,103,Drama,both
247,44,2015-03-27,Home,130000000,177397510,385997896,2015,255997896,2,tt2075392,6,96,Home,Yurt,2011,76,Drama,both
248,44,2015-03-27,Home,130000000,177397510,385997896,2015,255997896,2,tt2372760,7,306,Home,Hemma,2013,90,"Drama,Romance",both


In [74]:
merged_df_2.sort_values('total_profit', ascending=False).head(100)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,total_profit,ROI,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres,_merge
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2009,2351345279,6,tt1775309,6.0,43.0,Avatar,Abatâ,2011.0,93.0,Horror,both
42,43,1997-12-19,Titanic,200000000,659363944,2208208395,1997,2008208395,10,tt2495766,6.0,20.0,Titanic,Titanic,2012.0,,Adventure,both
6,7,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,2018,1748134200,6,tt4154756,8.0,670926.0,Avengers: Infinity War,Avengers: Infinity War,2018.0,149.0,"Action,Adventure,Sci-Fi",both
5,6,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,2015,1747311220,6,,,,,,,,,left_only
33,34,2015-06-12,Jurassic World,215000000,652270625,1648854864,2015,1433854864,7,tt0369610,7.0,539338.0,Jurassic World,Jurassic World,2015.0,124.0,"Action,Adventure,Sci-Fi",both
66,67,2015-04-03,Furious 7,190000000,353007020,1518722794,2015,1328722794,7,tt2820852,7.0,335074.0,Furious 7,Furious Seven,2015.0,137.0,"Action,Crime,Thriller",both
26,27,2012-05-04,The Avengers,225000000,623279547,1517935897,2012,1292935897,6,tt0848228,8.0,1183655.0,The Avengers,The Avengers,2012.0,143.0,"Action,Adventure,Sci-Fi",both
265,61,2011-07-15,Harry Potter and the Deathly Hallows: Part II,125000000,381193157,1341693157,2011,1216693157,10,,,,,,,,,left_only
41,42,2018-02-16,Black Panther,200000000,700059566,1348258224,2018,1148258224,6,tt1825683,7.0,516148.0,Black Panther,Black Panther,2018.0,134.0,"Action,Adventure,Sci-Fi",both
112,13,2018-06-22,Jurassic World: Fallen Kingdom,170000000,417719760,1305772799,2018,1135772799,7,tt4881806,6.0,219125.0,Jurassic World: Fallen Kingdom,Jurassic World: Fallen Kingdom,2018.0,128.0,"Action,Adventure,Sci-Fi",both


In [None]:
movie_title_df = imdb_df[['tconst' , 'primary_title' , 'original_title', 'start_year', 'runtime_minutes']]
movie_title_df

In [None]:
#imdb_df['o_tconst'] = imdb_df['tconst']
#imdb_df

In [None]:
#imdb_title1 = imdb_df[['primary_title' , 'p_tconst']]
#imdb_title2 = imdb_df[['original_title', 'o_tconst']]



In [None]:
#imdb_title = [imdb_title1, imdb_title2]
#imdb_title

In [None]:
#master_title_df_1= pd.merge(imdb_title1, table9_name, left_on='primary_title', right_on='movie', how='left' )
#master_title_df_1
                        

In [None]:
#master_title_df_1.sort_values('movie', ascending=True).head()

In [None]:
#master_title_df_2 = pd.merge(imdb_title2, table9_name, left_on='original_title', right_on='movie', how='left' )
#master_title_df_2

In [None]:
#master_title_df_2.sort_values('movie').head(20)

In [None]:
#master_title_df = pd.merge(master_title_df_1, master_title_df_2, left_on='primary_title', right_on='original_title', how='outer' )
#master_title_df

In [None]:
#master_title_df.sort_values('movie_x', ascending=True).head(1000)

In [None]:
#table9_name = table9[['movie', 'id']]
#table9_name

In [None]:
#table9.sort_values('total_profit', ascending=False).head(100)

In [None]:
#table9_titles = table9[['id' , 'movie' , 'release_year', 'total_profit', 'ROI', 'release_date']]
#table9_titles

In [None]:
#doing an inner join gives me 2,369 records

#merged_df = pd.merge(movie_title_df, table9_titles, left_on= 'primary_title', right_on= 'movie', how='inner')
#merged_df

#table9_titles
#movie_title_df

#df4 =pd.merge(df2, df3, left_on= 'tconst', right_on= 'tconst')
#df4

In [None]:
#an outer join gives me 74,630 rows
#merged_df_outer = pd.merge(movie_title_df, table9_titles, left_on= 'primary_title', right_on= 'movie', how='outer')
#merged_df_outer

In [None]:
#merged_df_outer.sort_values('total_profit', ascending=False).head(1000)

In [None]:
#The start_year and release_year values are no longer in datetime format. I need to change the null values
#to an appropriate value so that I can correctly format the years.

#merged_df_outer['start_year'] = pd.to_datetime(merged_df_outer['start_year'], errors='coerce')
#merged_df_outer

In [None]:
#I want to keep all of the data from table9 as the financial data is the most important to me.
#I will do right join and see if that gives me what I want.
merged_df_2 = pd.merge(movie_title_df, table9_titles, left_on= 'primary_title', right_on= 'movie', how='right')
merged_df_2

In [None]:
merged_df_2.info()

In [None]:
merged_df_2.sort_values('total_profit', ascending=False).head(100)

In [None]:
#merged_df_22 = pd.merge(merged_df_2, master_title_df_2, on= 'movie', how='outer')
#merged_df_22

In [None]:
#cleaned_df_22['p_tconst'] = merged_df_22['tconst']

In [None]:
#cleaned_df_22.sort_values('total_profit', ascending=False).head(10)

In [None]:
#dropping all the repeated or unneccesary columns.
#cleaned_df_22 = cleaned_df_22.drop('start_year', axis=1)
#cleaned_df_22.head(10)

In [None]:
#renaming some columns
#cleaned_df_22['id_or'] = cleaned_df_22['id_y']
#cleaned_df_22.head()

In [None]:
#cleaned_df_22.sort_values('total_profit', ascending=False).head(100)

In [None]:
#cleaned_df_22.info()

In [None]:
#master_df = cleaned_df_22
#master_df

In [None]:
#master_df_2 = master_df[['total_profit','movie', 'id_p', 'id_or', 'original_title_or','o_tconst', 'original_title_p', 'p_tconst']]
#master_df_2.sort_values('total_profit', ascending=False).head(100)

In [None]:
#movie_duplicates = master_df_3['movie'].duplicated()
#movie_duplicates.head(10)

In [None]:
merged_df_2

In [None]:
merged_df_outer.info()

In [None]:
#what if I joined on original title rather than primary title?

merged_df_3 = pd.merge(movie_title_df, table9_titles, left_on= 'original_title', right_on= 'movie', how='right')
merged_df_3

In [None]:
merged_df_3.info()

In [None]:
merged_df_3.sort_values('total_profit', ascending=False).head(100)

In [None]:
#Exploring some more datasets to see which ones I want to use.
#merged_rt_df = pd.merge(table6, table8, on= 'id', how='outer')
#merged_rt_df

In [None]:
#I can't figure out how to find the movie title on the RT database.
#with API I could possibly scrape it.
merged_rt_df.info()

In [None]:
#merged_rt_df_2 = merged_rt_df

In [None]:
#merged_rt_df['runtime'].isna().sum()

In [None]:
#corrected_runtime = merged_rt_df_2['runtime'].fillna(0)

In [None]:
#corrected_runtime

In [None]:
#corrected_runtime_2 = corrected_runtime.str[:3]
#corrected_runtime_2

In [None]:
#corrected_runtime_2.isna().sum()

In [None]:
#corrected_runtime_3 = corrected_runtime_2.fillna(0)

In [None]:
#corrected_runtime_3.value_counts()

In [None]:
#merged_rt_df_2['runtime'] = corrected_runtime_3

In [None]:
#merged_rt_df_2.info()

In [None]:
#merged_rt_df_2 = merged_rt_df_2.drop('box_office', axis=1)
#merged_rt_df_2.head(10)

In [None]:
#merged_rt_df_2['runtime'].isna().sum()



In [None]:
#droping null values so I can convert the reamining ones to float.
#merged_rt_df_2['runtime'].dropna(NaN)

In [None]:
#cleaned_runtime = merged_rt_df_2['runtime'].str[:3]
#cleaned_runtime.tail(5)

In [None]:
#cleaned_theater_date = pd.to_datetime(merged_rt_df_2['theater_date'])
#cleaned_theater_date

In [None]:
#merged_rt_df_2["theater_date"] = cleaned_theater_date
#merged_rt_df_2

In [None]:
df = merged_rt_df_2

In [None]:
duplicates = df[df.duplicated()]
print(len(duplicates))
duplicates.head()


In [None]:
merged_with_rt_2 = pd.merge(merged_df_2, merged_rt_df_2,  how='left', right_on=['theater_date','runtime'], left_on = ['release_date','runtime_minutes'])

In [None]:
merged_with_rt_2.sort_values('theater_date', ascending=True).head(100)

In [None]:
merged_with_rt = pd.merge(merged_df_2, df, left_on= 'release_date', right_on= 'theater_date', how='right')
merged_with_rt

In [None]:
merged_with_rt.head(200)

In [None]:
merged_df_2.info()

In [None]:
rt_duplicates = merged_rt_df_2[[merged_rt_df_2['id']].duplicated]



In [None]:
merged_rt_df_2.info()

In [None]:
merged_df_2.info()

In [None]:
merged_rt_df['runtime'][299]

In [None]:
merged_rt_df.sort_values('theater_date', ascending=False).head(100)

In [None]:
merged_rt_df

In [None]:
#Looking for some of the titles that aren't matching in my merged df
merged_df_4 = pd.merge(movie_title_df, table9_titles, left_on= 'original_title', right_on= 'movie', how='left')
merged_df_4

In [None]:
merged_df_4.sort_values('total_profit', ascending=False).head(100)

In [None]:
merged_df_5 = {}

for x in merged_df_3:
    if x['release_date'] >= 2010:
        merged_df_5.append(x)

merged_df_5
        

In [None]:
merged_df_3

In [None]:
#attempting to search for some of the missing values


In [None]:
merged_df_4['primary_title']

In [None]:
merged_df_4.str.contains('star', regex=False).head(100)

In [None]:
#I am going to attempt a join to see what the resulting dataframe would look like.
test_join_outer = table9.join(table7, how='inner')
test_join_outer
#That didn't work! They were joining by the IDs, which are completely different.
#Let me change the index for each and try again.

In [None]:
table9.set_index('movie', inplace=True)
table9

In [None]:
table7.set_index('primary_title', inplace=True)
table7

In [None]:
test_join_outer_2 = table9.join(table7, how='outer')
test_join_outer_2

In [None]:
test_join_outer_2.info()

In [None]:
test_join_outer_2.rename(columns={test_join_outer_2.columns[1]: 'new'})
test_join_outer_2


In [None]:
test_join_outer_2.sort_values('original_title', ascending=True).head(100)
test_join_outer_2

In [None]:
test_join_inner_2 = table9.join(table7, how='left')
test_join_inner_2

In [None]:
df1 = test_join_inner_2
df1.head(100)

In [None]:
df1.sort_values(by='worldwide_gross', ascending=False).head(100)

In [None]:
#Problems to address: Lots of duplicates. Avatar merged with a horror movie. 
#start_year needs to be changed to datetime. not a lot of genre information.

In [None]:
#Plan of attack: Clean up the data that I've got. Attempt to merge in table2.

In [None]:
start_year = df1['start_year']
start_year

In [None]:
cleaned_start_year = start_year.astype(str)
cleaned_start_year

In [None]:
cleaned_start_year_2 = cleaned_start_year.str.rstrip('.0')
cleaned_start_year_2

In [None]:
cleaned_start_year_2.astype(str)

In [None]:
df1['start_year']=cleaned_start_year_2
df1

In [None]:
#I am going to merge several of the IMDB datasets together, using the tconst as a key.
#imdb_df = pd.merge(table1, table4, left_on= 'tconst', right_on= 'tconst')
#df2

In [None]:
#df3 = pd.merge(table7, table11, left_on= 'tconst', right_on= 'tconst')
#df3

In [None]:
#df4 =pd.merge(df2, df3, left_on= 'tconst', right_on= 'tconst')
#df4


In [None]:
#df4 = df4.drop_duplicates('original_title', keep='first')

In [None]:
#df4.sort_values('original_title', ascending=True).head(100)

In [None]:
#dropped the columns that weren't useful. (can always get them back from the original sources)
#df4 = df4.drop('nconst', axis=1)
#df4.head(10)

In [None]:
#df4.info()

In [None]:
#exploring table10 to see if it contains any box office data that table9 does not contain.
table10.describe()

In [None]:
table10.info()

In [None]:
pizza = table10['foreign_gross']
pizza


In [None]:
table10['foriegn_gross2'] = pd.to_numeric(table10['foreign_gross'], errors='coerce')
table10

In [None]:
table10.drop(["foreign_gross2"], axis=1, inplace=True)
table10

In [None]:
total_gross = table10.apply(lambda x: x['domestic_gross'] + x['foriegn_gross'], axis=1)
total_gross

In [None]:
table10['total_gross'] = total_gross

In [None]:
table10.sort_values('total_gross', ascending=False).head(100)

In [None]:
merged_df_6 = pd.merge(merged_df_3, table10, left_on= 'movie', right_on= 'title', how='left')
merged_df_6

In [None]:
merged_df_6.sort_values('total_profit', ascending=False).head(5)

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
Data Analysis.
Make this a heading!

In [None]:
Questions that I want to answer:
    
    1) What are the genres of the movies that made the most profit?
    2) What are the genres of the movies with the best ROI?
    3) How much money do you need to spend to make these movies?
    4) Maybe look at directors/studios/etc?

## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***