# Top Earners in Movie Industry

## Table of Contents

<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="#intro"></a>
## Introduction

> This analysis project is to be done using the imdb movie data. When the analysis is completed, you should be able to find the top 5 highest grossing directors, the top 5 highest grossing movie genres of all time, comparing the revenue of the highest grossing movies and which companies released the most movies. 

> There are 10 columns that will not be needed for the analysis. Use pandas to drop these columns. HINT: Only the columns pertaining to revenue will be needed.

> To get you started, I've already placed the needed code for getting the packages and datafile that you will be using for the project. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:

df = pd.read_csv('imdb-movies.csv')

### Drop columns without neccesary information and remove all records with no financial information -- Pay close attention to things that don't tell you anything regarding financial data

In [3]:
df.head(10)
df.count()

id                      10866
imdb_id                 10856
popularity              10866
budget                  10866
revenue                 10866
original_title          10866
cast                    10790
homepage                 2936
director                10822
tagline                  8042
keywords                 9373
overview                10862
runtime                 10866
genres                  10843
production_companies     9836
release_date            10866
vote_count              10866
vote_average            10866
release_year            10866
budget_adj              10866
revenue_adj             10866
dtype: int64

In [4]:
df.dtypes

id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

In [5]:
# display all columns and decide what is needed and what is not needed
# find top 5 highests grossing directors
# top 5 grossing movie genres of all time
# compare revenue of the highest grossing movies
# which companies released the most movies?

df.columns

Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

In [6]:
# Do this once. run it a second time there will be an error message.
# To answer the questions all we need are the movie titles, director, genres, production companies, budget, and revenue
df=df.drop(['id', 'imdb_id','keywords','cast','homepage','overview','vote_count', 'vote_average','tagline', 'runtime','budget_adj',
       'revenue_adj' ],axis=1)

In [7]:
df.duplicated().sum()

1

In [8]:
# display first 15
df[df.isna().any(axis=1)].head(15)

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,production_companies,release_date,release_year
228,0.584363,0,0,Racing Extinction,Louie Psihoyos,Adventure|Documentary,,1/24/15,2015
259,0.476341,0,0,Crown for Christmas,Alex Zamm,TV Movie,,11/27/15,2015
295,0.417191,0,0,12 Gifts of Christmas,Peter Sullivan,Family|TV Movie,,11/26/15,2015
298,0.370258,0,0,The Girl in the Photographs,Nick Simon,Crime|Horror|Thriller,,9/14/15,2015
328,0.367617,0,0,Advantageous,Jennifer Phang,Science Fiction|Drama|Family,,6/23/15,2015
370,0.314199,0,2334228,Meru,Jimmy Chin|Elizabeth Chai Vasarhelyi,Adventure|Documentary,,1/25/15,2015
374,0.302474,0,0,The Sisterhood of Night,Caryn Waechter,Mystery|Drama|Thriller,,4/10/15,2015
382,0.295946,0,0,Unexpected,Kris Swanberg,Drama|Comedy,,7/24/15,2015
388,0.289526,700000,0,Walter,Anna Mastro,Drama|Comedy,,3/13/15,2015
393,0.283194,2000000,0,Night Of The Living Deb,Kyle Rankin,Comedy|Horror,,8/29/15,2015


In [9]:
df.isna().sum()

popularity                 0
budget                     0
revenue                    0
original_title             0
director                  44
genres                    23
production_companies    1030
release_date               0
release_year               0
dtype: int64

### Data Cleaning

In [10]:
# Delete all records with null, or empty values
df.dropna(inplace=True)
df.isna().sum()

popularity              0
budget                  0
revenue                 0
original_title          0
director                0
genres                  0
production_companies    0
release_date            0
release_year            0
dtype: int64

In [11]:
# display cleaned data. in this case i displayed FIRST 10
df.head(10)

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,production_companies,release_date,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,2015
2,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2015
3,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,2015
4,9.335014,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2015
5,9.1107,135000000,532950503,The Revenant,Alejandro GonzÃ¡lez IÃ±Ã¡rritu,Western|Drama|Adventure|Thriller,Regency Enterprises|Appian Way|CatchPlay|Anony...,12/25/15,2015
6,8.654359,155000000,440603537,Terminator Genisys,Alan Taylor,Science Fiction|Action|Thriller|Adventure,Paramount Pictures|Skydance Productions,6/23/15,2015
7,7.6674,108000000,595380321,The Martian,Ridley Scott,Drama|Adventure|Science Fiction,Twentieth Century Fox Film Corporation|Scott F...,9/30/15,2015
8,7.404165,74000000,1156730962,Minions,Kyle Balda|Pierre Coffin,Family|Animation|Adventure|Comedy,Universal Pictures|Illumination Entertainment,6/17/15,2015
9,6.326804,175000000,853708609,Inside Out,Pete Docter,Comedy|Animation|Family,Walt Disney Pictures|Pixar Animation Studios|W...,6/9/15,2015


#### Here's a helpful hint from my own analysis when I ran this the first time. This may help shed light on what your data set should look like.

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

In [12]:
# CREATE TWO DATAFRAMES
# One without genres
# one without poroduction companies

In [13]:
# genre dataframe without production_companies column
genre_df = df.drop(['production_companies'],axis=1)
genre_df

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,release_date,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,6/9/15,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,5/13/15,2015
2,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,3/18/15,2015
3,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,12/15/15,2015
4,9.335014,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,4/1/15,2015
...,...,...,...,...,...,...,...,...
10861,0.080598,0,0,The Endless Summer,Bruce Brown,Documentary,6/15/66,1966
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,12/21/66,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery|Comedy,1/1/66,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action|Comedy,11/2/66,1966


In [14]:
# production dataframe without genre column
production_df=df.drop(['genres'],axis=1)
production_df

Unnamed: 0,popularity,budget,revenue,original_title,director,production_companies,release_date,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,2015
2,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2015
3,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,2015
4,9.335014,190000000,1506249360,Furious 7,James Wan,Universal Pictures|Original Film|Media Rights ...,4/1/15,2015
...,...,...,...,...,...,...,...,...
10861,0.080598,0,0,The Endless Summer,Bruce Brown,Bruce Brown Films,6/15/66,1966
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Cherokee Productions|Joel Productions|Douglas ...,12/21/66,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mosfilm,1/1/66,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Benedict Pictures Corp.,11/2/66,1966


In [15]:
# function is used to reorganize and display all rows of the data
def splitDataFrame(df,target_column,separator):
    def splitToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pd.DataFrame(new_rows)
    return new_df

In [16]:
genre_df=splitDataFrame(genre_df,'genres','|')
genre_df

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,release_date,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action,6/9/15,2015
1,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Adventure,6/9/15,2015
2,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Science Fiction,6/9/15,2015
3,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Thriller,6/9/15,2015
4,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action,5/13/15,2015
...,...,...,...,...,...,...,...,...
24712,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery,1/1/66,1966
24713,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Comedy,1/1/66,1966
24714,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action,11/2/66,1966
24715,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Comedy,11/2/66,1966


In [17]:
production_df=splitDataFrame(production_df,'production_companies','|')
production_df

Unnamed: 0,popularity,budget,revenue,original_title,director,production_companies,release_date,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Universal Studios,6/9/15,2015
1,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Amblin Entertainment,6/9/15,2015
2,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Legendary Pictures,6/9/15,2015
3,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Fuji Television Network,6/9/15,2015
4,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Dentsu,6/9/15,2015
...,...,...,...,...,...,...,...,...
23189,0.065543,0,0,Grand Prix,John Frankenheimer,Joel Productions,12/21/66,1966
23190,0.065543,0,0,Grand Prix,John Frankenheimer,Douglas & Lewis Productions,12/21/66,1966
23191,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mosfilm,1/1/66,1966
23192,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Benedict Pictures Corp.,11/2/66,1966


<a id="eda"></a>
## Exploratory Data Analysis

> Use Matplotlib to display your data analysis

In [None]:
# which companies released the most movies?
# top 5 grossing movie genres of all time
# find top 5 highests grossing directors
# compare revenue of the highest grossing movies


### Which production companies released the most movies in the last 10 years? Display the top 5 production companies.

In [19]:
# assigned year to movies released in the last 10 years from today
year=production_df.loc[production_df['release_year'] > 2011]

In [23]:
# display top 5 companies with the most movies
year['production_companies'].value_counts().nlargest(5)

Universal Pictures       51
Warner Bros.             46
Paramount Pictures       35
Columbia Pictures        35
Blumhouse Productions    33
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [39]:
genre_df.groupby("genres").revenue.sum().nlargest(5)

genres
Action       173418313979
Adventure    166317625752
Comedy       142141376544
Drama        138896772395
Thriller     121189561087
Name: revenue, dtype: int64

In [43]:
genre_df['profit'] = genre_df.revenue - genre_df.budget
print("Profit")
genre_df.groupby("genres").profit.sum().nlargest(5)

Profit


genres
Adventure    111216306097
Action       107540425634
Comedy        91947521248
Drama         82656036667
Thriller      71402548810
Name: profit, dtype: int64

### Who are the top 5 grossing directors?

In [44]:
df.groupby(["director"]).revenue.sum().nlargest(5)

director
Steven Spielberg     9018563772
Peter Jackson        6523244659
James Cameron        5841894863
Michael Bay          4917208171
Christopher Nolan    4167548502
Name: revenue, dtype: int64

In [45]:
df['profit'] = df.revenue - df.budget
print("Profit")
df.groupby("director").profit.sum().nlargest(5)

Profit


director
Steven Spielberg    7428613772
Peter Jackson       5196468949
James Cameron       5081849077
Michael Bay         3557208171
David Yates         3379295625
Name: profit, dtype: int64

### Compare the revenue of the highest grossing movies of all time.

In [35]:
# display total revenue from revenue column
print("Revenue")
df.groupby("original_title").revenue.sum().nlargest(5)

Revenue


original_title
Avatar                          2781505847
Star Wars: The Force Awakens    2068178225
Titanic                         1845034188
The Avengers                    1568080742
Jurassic World                  1513528810
Name: revenue, dtype: int64

In [38]:
# display profit subtracting cost to make the movie
df['profit'] = df.revenue - df.budget
print("Profit")
df.groupby("original_title").profit.sum().nlargest(5)

Profit


original_title
Avatar                          2544505847
Star Wars: The Force Awakens    1868178225
Titanic                         1632034188
Jurassic World                  1363528810
Furious 7                       1316249360
Name: profit, dtype: int64

<a id="conclusions"></a>
## Conclusions

> Using the cell below, write a brief conclusion of what you have found from the anaylsis of the data. The Cell below will allow you to write plan text instead of code.

- Universal created the most movies in the last 10 years according to this dataset
- Paramount and Columbia tied for thirds place for the amount of movies released in the last 10 years
- Action makes the most gross revenue, but adventure makes the most profit overall factoring in budget
- Steven Spielberg makes the most gross revenue and profit compared to the other directors
- Avatar is the highest grossing movie of all time!
- when comparing gross revenue and profit of the highest grossing movies of all time 
- Jurassic World tops The Avengers and the 5th place is taken over by Furious 7
