# Top Earners in Movie Industry

## Table of Contents

<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="#intro"></a>
## Introduction

> This analysis project is to be done using the imdb movie data. When the analysis is completed, you should be able to find the top 5 highest grossing directors, the top 5 highest grossing movie genres of all time, comparing the revenue of the highest grossing movies and which companies released the most movies. 

> There are 10 columns that will not be needed for the analysis. Use pandas to drop these columns. HINT: Only the columns pertaining to revenue will be needed.

> To get you started, I've already placed the needed code for getting the packages and datafile that you will be using for the project. 

In [4]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [5]:
df = pd.read_csv('files/imdb-movies.csv')

### Drop columns without neccesary information and remove all records with no financial information -- Pay close attention to things that don't tell you anything regarding financial data

In [6]:
df.drop(df.query('budget == 0 or revenue == 0').index, inplace = True)
df.drop(['id','popularity','imdb_id','homepage','budget', 'runtime','tagline','keywords','vote_count','vote_average','release_date','budget_adj','overview'], axis=1, inplace=True)

In [7]:
df.dropna(inplace = True)
df.reset_index(drop = True, inplace = True)
df.head(5)

Unnamed: 0,revenue,original_title,cast,director,genres,production_companies,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015,1392446000.0
1,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015,348161300.0
2,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015,271619000.0
3,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015,1902723000.0
4,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015,1385749000.0


In [8]:
#Splitting production records

production_records = df[df.production_companies.str.contains('|')] 
production_records.production_companies = production_records.production_companies.apply(lambda x: x.split('|'))

df_pc = pd.DataFrame(columns = df.columns)
for i in range(len(production_records)):
    record = production_records.iloc[i] # returns all data for every single record with the ability to access columns
    for production_company in production_records.production_companies[i]: # loop through ALL production_companies from EACH record
        df_pc = df_pc.append(pd.DataFrame([[record.revenue, record.original_title, record.cast, record.director,record.genres, production_company, record.release_year, record.revenue_adj]], columns = df.columns))
            
df_pc.drop('genres', axis=1, inplace=True)
df_pc.reset_index(drop=True,inplace=True)

df_pc.head()

Unnamed: 0,revenue,original_title,cast,director,production_companies,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Universal Studios,2015,1392446000.0
1,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Amblin Entertainment,2015,1392446000.0
2,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Legendary Pictures,2015,1392446000.0
3,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Fuji Television Network,2015,1392446000.0
4,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Dentsu,2015,1392446000.0


In [15]:
#Splitting genres

production_records = df[df.genres.str.contains('|')]
production_records.genres = production_records.genres.apply(lambda x: x.split('|'))

df_g = pd.DataFrame(columns = df.columns)
for i in range(len(production_records)):
    record = production_records.iloc[i] # returns all data for every single record with the ability to access columns
    for genres in production_records.genres[i]: # loop through ALL production_companies from EACH record
        df_g = df_g.append(pd.DataFrame([[record.revenue, record.original_title, record.cast, record.director,genres,record.production_companies, record.release_year, record.revenue_adj]], columns = df.columns))
df_g.drop('production_companies', axis=1, inplace=True)
df_g.reset_index(drop=True,inplace=True)


In [16]:
df_g.head()

Unnamed: 0,revenue,original_title,cast,director,genres,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action,2015,1392446000.0
1,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Adventure,2015,1392446000.0
2,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Science Fiction,2015,1392446000.0
3,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Thriller,2015,1392446000.0
4,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,Action,2015,348161300.0


### Data Cleaning

In [None]:
# Delete all records with null, or empty values
# df.reset_index(drop = True, inplace = True)
# df.head(10)
df.dropna(inplace = True)
df

#### Here's a helpful hint from my own analysis when I ran this the first time. This may help shed light on what your data set should look like.

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

In [None]:
df_genres = df.drop('production_companies', axis=1, inplace=False)
df_production_companies = df.drop('genres', axis=1, inplace=False) 

<a id="eda"></a>
## Exploratory Data Analysis

> Use Matplotlib to display your data analysis

### Which production companies released the most movies in the last 10 years? Display the top 5 production companies.

In [17]:
last_10 = df_pc.query('release_year > 2007').production_companies.value_counts().nlargest(10)
last_10

Relativity Media                          90
Universal Pictures                        89
Warner Bros.                              87
Columbia Pictures                         73
Paramount Pictures                        60
Twentieth Century Fox Film Corporation    52
New Line Cinema                           45
Walt Disney Pictures                      41
Summit Entertainment                      38
Lionsgate                                 37
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [18]:
high_gross = df_g.groupby('genres')['revenue'].sum().nlargest(5)
high_gross

genres
Action       169790679429
Adventure    163439947064
Comedy       132020867028
Drama        130323128041
Thriller     117587434010
Name: revenue, dtype: int64

### Who are the top 5 grossing directors?

In [21]:
highest_director = df_g.groupby('director')['revenue'].sum().nlargest(5)
highest_director

director
Steven Spielberg     24663086098
James Cameron        20132327500
Peter Jackson        17530037629
Michael Bay          16297866403
Christopher Nolan    16196885522
Name: revenue, dtype: int64

### Compare the revenue of the highest grossing movies of all time.

In [22]:
highest_movie = df_g.groupby('original_title')['revenue'].sum().nlargest(5)
highest_movie

original_title
Avatar                          11126023388
Star Wars: The Force Awakens     8272712900
Jurassic World                   6054115240
Titanic                          5535102564
The Net                          5531398290
Name: revenue, dtype: int64

<a id="conclusions"></a>
## Conclusions

> Using the cell below, write a brief conclusion of what you have found from the anaylsis of the data. The Cell below will allow you to write plan text instead of code.

From the analysis done, the top two production companies are Relatively Media and Universal Pictures with only a movie seperating them. Based on the total revenue that each movie produced, the highest recorded genre were action movies and right behind it is adventure movies. Steven Spielberg grossed the most money out of all the other directors concluding that movies that he is directing are must watch movies. The highest grossing movie was Avatar which produced the most revenue although it did cost a lot of time and money for it to be produced.