# Top Earners in the Movie Industry

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> I chose the IMDB movie dataset. I've wanted to know how much the different movie genres, directors and production companies have grossed over a period of time.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('imdb-movies.csv')

In [3]:
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


### Data Cleaning

In [4]:
# Drop columns without neccesary information and remove all records with no financial information
# drop colums homepage, cast,  tagline, overview, runtime, vote_count, vote_average, 
df=df.drop(['homepage','cast','tagline','overview','runtime','vote_count','vote_average','keywords'],axis=1)

In [5]:
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,director,genres,production_companies,release_date,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2015,174799900.0,1385749000.0


In [11]:
#check which columns in our dataset contain nan values
df.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
director                  44
genres                    23
production_companies    1030
release_date               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [12]:
#drop nan values from the columns that contain nan values
df.dropna(subset=['imdb_id'], inplace=True)

In [15]:
df.dropna(subset=['director'], inplace=True)

In [17]:
df.dropna(subset=['genres'], inplace=True)

In [19]:
df.dropna(subset=['production_companies'], inplace=True)

In [20]:
df.isnull().sum()

id                      0
imdb_id                 0
popularity              0
budget                  0
revenue                 0
original_title          0
director                0
genres                  0
production_companies    0
release_date            0
release_year            0
budget_adj              0
revenue_adj             0
dtype: int64

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

#### One `production_companies` per record

In [40]:
df_prod = df.drop(['genres'],axis=1)

In [36]:
df_prod[['production_companies','prod1','prod2','prod3','prod4']] = df_prod['production_companies'].str.split('|',expand=True)

In [38]:
df_prod = df.drop(['prod1','prod2','prod3','prod4'],axis=1)

In [41]:
df_prod.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,director,production_companies,release_date,release_year,budget_adj,revenue_adj,prod1,prod2,prod3,prod4
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Universal Studios,6/9/15,2015,137999900.0,1392446000.0,Amblin Entertainment,Legendary Pictures,Fuji Television Network,Dentsu
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Village Roadshow Pictures,5/13/15,2015,137999900.0,348161300.0,Kennedy Miller Productions,,,
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Summit Entertainment,3/18/15,2015,101200000.0,271619000.0,Mandeville Films,Red Wagon Entertainment,NeoReel,
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Lucasfilm,12/15/15,2015,183999900.0,1902723000.0,Truenorth Productions,Bad Robot,,
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,James Wan,Universal Pictures,4/1/15,2015,174799900.0,1385749000.0,Original Film,Media Rights Capital,Dentsu,One Race Films


In [1]:
# GENRES
# For every string of genres in that record, split the production companies into a list. 
# This way we should be able to query whichever production company

In [25]:
df_genres = df.drop(['production_companies'],axis=1)

In [44]:
df_genres[['genres','g1','g2','g3','g4']] = df_genres['genres'].str.split('|',expand=True)

In [47]:
df_genres = df_genres.drop(['g1','g2','g3','g4'],axis=1)

In [48]:
df_genres.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,director,genres,release_date,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action,6/9/15,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action,5/13/15,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure,3/18/15,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action,12/15/15,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,James Wan,Action,4/1/15,2015,174799900.0,1385749000.0


#### One `genres` per record

<a id='eda'></a>
## Exploratory Data Analysis

### Which production companies released the most movies in the last 10 years? Display the top 10 production companies.

In [64]:
new = df.query('release_year > 2011')


In [68]:
new['production_companies'].value_counts().head(10)

Universal Pictures                        48
Columbia Pictures                         35
Paramount Pictures                        35
Walt Disney Pictures                      28
The Asylum                                25
BBC Films                                 23
Twentieth Century Fox Film Corporation    22
New Line Cinema                           19
Lionsgate                                 18
Summit Entertainment                      18
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [59]:
df_genres.groupby('genres').revenue.sum().sort_values(ascending = False).head()

genres
Action       96487224972
Adventure    73040959448
Comedy       67894064795
Drama        61581731996
Animation    28748603451
Name: revenue, dtype: int64

### Who are the top 10 grossing directors?

In [60]:
df.groupby('director').revenue.sum().sort_values(ascending = False).head(10)

director
Steven Spielberg     9018563772
Peter Jackson        6523244659
James Cameron        5841894863
Michael Bay          4917208171
Christopher Nolan    4167548502
David Yates          4154295625
Robert Zemeckis      3869690869
Chris Columbus       3851491668
Tim Burton           3665414624
Ridley Scott         3649996480
Name: revenue, dtype: int64

### Compare the revenue of the highest grossing movies of all time.

In [61]:
df.groupby('original_title').revenue.sum().sort_values(ascending = False).head()

original_title
Avatar                          2781505847
Star Wars: The Force Awakens    2068178225
Titanic                         1845034188
The Avengers                    1568080742
Jurassic World                  1513528810
Name: revenue, dtype: int64

<a id='conclusions'></a>
## Conclusions

* Avatar is the highest-grossing movie of all time.

* Steven Spielberg is the highest-grossing director of all time.

* Action movies (not to my surprise) are the highest-grossing movies..

* Disney is not one of the top 5 highest-grossing production companies during the last 10 years.