# Phase 1 Project

![microsoft_image.png](attachment:microsoft_image.png)

## Overview

## Business Problem

Microsoft has decided to open a new branch of their company: Microsoft Movie Studios. In order to aid the new head of Microsoft Movie Studios in making decisions regarding what type of films to create, I explore and analyze data from two sources: IMDb and Box Office Mojo. In doing so, I look at the ratings and lengths of various films, genres, production studios, and gross earnings.

## Data Understanding

In [802]:
#Importing neccesary packages
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [803]:
#Loading in data from imdb file
conn = sqlite3.Connection('zippedData/im.db')

imdb_tables = """
SELECT name FROM sqlite_master WHERE type='table'
"""
pd.read_sql(imdb_tables, conn)

Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


In [804]:
#Loading in data from the movie_gross file taken from Box Office Mojo
movie_gross_df = pd.read_csv('./zippedData/bom.movie_gross.csv.gz')

### IMDb Data

#### Movie_Basics Table

This table contains data from 2010 to 2022 along with few data entries from 2023 to 2027 and 2115.
However, given that 2023 and onward has yet to occur, we can assume that any data after 2022 will likely not be relevant.

The data in this table includes information such as start year, genre, and runtime.

In [805]:
#Looking at the number of rows (146144) and columns (6)
movie_basics_df = pd.read_sql("""SELECT * FROM movie_basics;""", conn)

In [806]:
#Looking at the general structure and content of the movie_basics table from the IMDb file
movie_basics_df.info()

#6 columns and 146,144 rows with missing values in 'original_title', 'runtime_minutes', and 'genres'

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [807]:
#Looking at the descriptive statistics for the numerical columns start_year and and runtime_minutes
movie_basics_df.describe()

#We see here that in the runtime_minutes column, both the min and max are outliers 
#with values of 1 and 51,420, respectively.

#Furthermore, when disregarding the outliers, the range of movie runtimes is quite small 
#with the 25% quartile being 70 minutes and the 75% quartile being 99 minutes.

#We also see that the max in the start_year column is an outlier as it is 2115 
#and suggests that we should analyze the column start_years in more depth, if needed.

#This table will likely provide me with a large portion of the data that will be used in my data analysis.

Unnamed: 0,start_year,runtime_minutes
count,146144.0,114405.0
mean,2014.6218,86.18725
std,2.73358,166.36059
min,2010.0,1.0
25%,2012.0,70.0
50%,2015.0,87.0
75%,2017.0,99.0
max,2115.0,51420.0


In [808]:
#Looking at how many movies were released in each year
pd.read_sql("""
    SELECT COUNT(*) AS num_movies_per_year, start_year
    FROM movie_basics
    GROUP BY start_year;""", conn)

#This DataFrame shows that some movies that have not been released yet (2023 and onward) are included in this dataset.

#Therefore, in our data preparation, we will create a new table that includes 
#only the movies that have already been released.

Unnamed: 0,num_movies_per_year,start_year
0,11849,2010
1,12900,2011
2,13787,2012
3,14709,2013
4,15589,2014
5,16243,2015
6,17272,2016
7,17504,2017
8,16849,2018
9,8379,2019


In [809]:
#Looking at the average movie runtime in minutes per year
movie_basics_df.groupby('start_year').mean()

#This DataFrame shows the mean runtime in minutes per year.

#As we can see, length of movies have increased over time.

#Furthermore, this DataFrame confirms that we should remove the years 2023 - 2115 when we clean our data.

Unnamed: 0_level_0,runtime_minutes
start_year,Unnamed: 1_level_1
2010,85.49569
2011,86.41011
2012,89.20886
2013,84.93167
2014,84.5415
2015,85.40711
2016,84.97425
2017,85.73221
2018,87.6611
2019,90.88736


In [810]:
#Looking at how many movies fall under specific genre categories (ex: how many movies are from the Drama genre)
#pd.read_sql("""
    #SELECT COUNT(*) AS num_movies_per_genre, genres
    #FROM movie_basics
    #GROUP BY genres
    #ORDER BY num_movies_per_genre DESC;""", conn).head(5)

#As we can see from this dateframe, the top five genres in this dataset are Documentary, Drama, Comedy, None, and Horror

#Due to the fact that 5408 entries are lacking a genre specification, we need to assess if these entries will provide
#us with insightful information or we will remove them.

#### Movie_Ratings Table

This table includes information about the average rating and number of ratings for the movies in this datset.

In [811]:
#Exploring the movie_ratings table from the imbd data
movie_ratings_df = pd.read_sql("""SELECT * FROM movie_ratings;""", conn)
movie_ratings_df.info()

#3 columns and 73,856 rows with no missing data in any column

#Based on the column names, this table will likely be very useful in analyzing audience opinions on movies.

#Joining the movie_basics table and the movie_ratings table will provide us with 
#a greater understanding of the general information of the movies in the dataset.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [812]:
#Deactivating scientific notation
pd.set_option('display.float_format', lambda x: '%.5f' % x)

#Looking at the descriptive statistics for the numerical columns averagerating and numvotes
movie_ratings_df.describe()

#From this dateframe, we can see that the column numvotes has an outlier of 1,841,066 as its max.

Unnamed: 0,averagerating,numvotes
count,73856.0,73856.0
mean,6.33273,3523.66217
std,1.47498,30294.02297
min,1.0,5.0
25%,5.5,14.0
50%,6.5,49.0
75%,7.4,282.0
max,10.0,1841066.0


In [813]:
#Looking at how many ratings there are of each average rating (ex: how many movies were given a 7.0 rating)
#pd.read_sql("""
    #SELECT COUNT(averagerating) AS num_of_avgratings, averagerating
    #FROM movie_ratings
    #GROUP BY averagerating
    #ORDER BY COUNT(averagerating) DESC;""", conn)

#This data shows us that the most common rating is a 7.0 and the least common rating is a 9.9.

#### Directors Table

This table contains information regarding the directors of the movies in this dataset.

In [814]:
#Exploring the directors table from the imbd data
directors_df = pd.read_sql("""SELECT * FROM directors;""", conn)
directors_df.info()

#2 columns (movie_id and people_id) with 291174 rows and no missing values

#This data will be useful when looking at the associations between the success of a movie and who directed it.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291174 entries, 0 to 291173
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   movie_id   291174 non-null  object
 1   person_id  291174 non-null  object
dtypes: object(2)
memory usage: 4.4+ MB


#### Known_For Table

In [815]:
#Exploring the known_for table from the imbd data
#known_for_df = pd.read_sql("""SELECT * FROM known_for;""", conn)
#known_for_df.info()

#2 columns (movie_id and people_id) with 1638260 rows and no missing values

#### Principals Table

This table contains information regarding the cast and crew from the movies in this dataset

In [816]:
#Exploring the known_for table from the imbd data
principals_df = pd.read_sql("""SELECT * FROM principals;""", conn)
principals_df.info()

#6 columns with 1028186 rows and missing values in 'job' and 'characters'

#Like the directors table, this data will be useful when looking at the associations 
#between the success of a movie and which actors and actresses were in it.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   movie_id    1028186 non-null  object
 1   ordering    1028186 non-null  int64 
 2   person_id   1028186 non-null  object
 3   category    1028186 non-null  object
 4   job         177684 non-null   object
 5   characters  393360 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


#### Persons Table

This table contains information that can build upon the directors and principals tables such as primary name, birth year, and death year.

In [817]:
#Exploring the persons table from the imbd data
persons_df = pd.read_sql("""SELECT * FROM persons;""", conn)
persons_df.info()

#5 columns with 606648 rows and missing values in 'birth_year', 'death_year', and 'primary_profession'

#This table will be useful when joined with the directors table and the 
#principals table (separately) because it will match the person_id from 
#each of the respective tables and provide me with the associated primary_name.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   person_id           606648 non-null  object 
 1   primary_name        606648 non-null  object 
 2   birth_year          82736 non-null   float64
 3   death_year          6783 non-null    float64
 4   primary_profession  555308 non-null  object 
dtypes: float64(2), object(3)
memory usage: 23.1+ MB


### Movie Gross Data

This dataset includes information about numerous movies such as title, studio, year, and both domestic and foreign gross. 

In [818]:
#Looking at the general information and structure of the movie_gross data
movie_gross_df.info()

#From this, we see that the movie_gross dataset has 3387 rows and 5 columns, 
#and the columns with missing data are studio, domestic_gross, and foreign gross

#It is interesting that, unlike domestic_gross, foreign_gross is listed as the datatype object 
#meaning that either all of the values are strings or there is a mix of datatypes

#We will want to convert all the data from foreign_gross to floats

#This data will help me understand the success of different types of movies from a business standpoint

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [819]:
#After looking more thoroughly at the data in foreign_gross, 
#we can see that some values are strings and other values are floats

#In order to change all the values in foreign_gross to floats, we need to drop all the NaN values from ALL the columns
#rather than just the foreign_gross column
#Doing so will allow us to maintain a sense of consistency across all columns

movie_gross_df.dropna(inplace = True)

movie_gross_df.info()
    

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2007 entries, 0 to 3353
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           2007 non-null   object 
 1   studio          2007 non-null   object 
 2   domestic_gross  2007 non-null   float64
 3   foreign_gross   2007 non-null   object 
 4   year            2007 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 94.1+ KB


In [820]:
#Converting the datatype of all values to strings for uniformity
movie_gross_df['foreign_gross']= movie_gross_df['foreign_gross'].apply(lambda x: str(x))

#Removing any puncuation in the strings (commas) 
movie_gross_df['foreign_gross']= movie_gross_df['foreign_gross'].apply(lambda x: x.replace(',',''))

#Changing the datatype of all values to float
movie_gross_df['foreign_gross']=movie_gross_df['foreign_gross'].apply(lambda x: float(x))

movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2007 entries, 0 to 3353
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           2007 non-null   object 
 1   studio          2007 non-null   object 
 2   domestic_gross  2007 non-null   float64
 3   foreign_gross   2007 non-null   float64
 4   year            2007 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 94.1+ KB


In [821]:
#Looking at the descriptive statistics of the numerical columns domestic gross, foreign gross, and year
movie_gross_df.describe()

Unnamed: 0,domestic_gross,foreign_gross,year
count,2007.0,2007.0,2007.0
mean,47019840.20179,75790384.84131,2013.50623
std,81626889.32324,138179552.62752,2.598
min,400.0,600.0,2010.0
25%,670000.0,3900000.0,2011.0
50%,16700000.0,19400000.0,2013.0
75%,56050000.0,75950000.0,2016.0
max,936700000.0,960500000.0,2018.0


## Data Preparation

### Creating and Cleaning the Joined_MB_MR DataFrame

I create and clean a DataFrame called Joined_MB_MR which joins the movie_basics table and the movie_ratings table.

In [855]:
#Joining the movie_basics and movie_ratings tables to access more data about the movies
joined_mb_mr = pd.read_sql("""
    SELECT * 
    FROM movie_basics as mb
    JOIN movie_ratings as mr
        USING (movie_id)
    ;""", conn)

In [823]:
#Exploring the new DataFrame
joined_mb_mr.info()


#After the join, we have 73,856 rows and 9 columns

#The only columns with any missing values are 'runtime_minutes' and 'genres'

#The number of missing values in both of these columns has decreased significantly after the join

#The missing values in 'genre' dropped from 5408 missing values out of 146,144 in movie_basics 
#to 804 missing values out of 73,856 in this new DataFrame

#The missing values in 'runtime_minutes' dropped from 31,739 missing values out of 146,144 in movie_basics
#to 7,620 mission valoues out of 73,856 in this new DataFrame

#IMPORTANT: This data only shows movies from 2010 until 2019

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   averagerating    73856 non-null  float64
 7   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


In [824]:
#Taking out missing data from joined_mb_mr dataset

#As mentioned above, there are two columns with missing values: 'runtime_minutes' and 'genres'.

#Because it is difficult to replace genres and replacing missing runtime_minutes values with the median 
#would likley create an inaccurate image of the data, I will drop all missing values

joined_mb_mr = joined_mb_mr.dropna()
joined_mb_mr

#Now there are 65720 rows and 8 columns

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.00000,"Action,Crime,Drama",7.00000,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00000,"Biography,Drama",7.20000,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00000,Drama,6.90000,4517
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00000,"Comedy,Drama,Fantasy",6.50000,119
6,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.00000,"Adventure,Animation,Comedy",8.10000,263
...,...,...,...,...,...,...,...,...
73849,tt9911774,Padmavyuhathile Abhimanyu,Padmavyuhathile Abhimanyu,2019,130.00000,Drama,8.40000,365
73850,tt9913056,Swarm Season,Swarm Season,2019,86.00000,Documentary,6.20000,5
73851,tt9913084,Diabolik sono io,Diabolik sono io,2019,75.00000,Documentary,6.20000,6
73852,tt9914286,Sokagin Çocuklari,Sokagin Çocuklari,2019,98.00000,"Drama,Family",8.70000,136


In [825]:
#Making genres column into list of multiple genres, so I can access the first genre listed.

joined_mb_mr['genres'] = joined_mb_mr['genres'].apply(lambda x: x.split(','))

In [826]:
#Creating a new column called genres1 that shows the first genre of each movie.
#I do this so that I can categorize each movie by a single genre.

joined_mb_mr['genres1'] = joined_mb_mr['genres'].apply(lambda x: x[0])

### Preparing the Movie Gross DataFrame

I prepare the movie_gross_df DataFrame by creating a new column called total_revenue, which is the sum of domestic_gross and foreign_gross, so I can analyze the success of a movie based on total earnings.

In [827]:
#Adding a total revenue column which is the dometic gross and foreign gross combined
movie_gross_df['total_revenue'] =  movie_gross_df['domestic_gross'] + movie_gross_df['foreign_gross']

### Preparing the DataFrame for Insight 1a: Average Rating Per Genre

I create a new DataFrame called avg_ratings_per_genre_filtered to analyze the average rating per genre in the years 2010-2019.

In [107]:
#2010 - 2019
#Creating a new DataFrame that shows the average rating per genre
avg_ratings_per_genre = joined_mb_mr.groupby('genres1')['averagerating'].mean().sort_values(ascending = False)
avg_ratings_per_genre = pd.DataFrame(avg_ratings_per_genre)

In [108]:
#Adding the number of movies per genre
num_movies_per_genre = joined_mb_mr.groupby('genres1')['movie_id'].count()
avg_ratings_per_genre['num_movies_per_genre'] = num_movies_per_genre

In [109]:
#Adding the number of ratings per genre
avg_ratings_per_genre['numvotes'] = joined_mb_mr.groupby('genres1')['numvotes'].sum()

In [110]:
#Limiting the results to only genres with more than 1000 movies
avg_ratings_per_genre_filtered = avg_ratings_per_genre.loc[avg_ratings_per_genre['num_movies_per_genre'] > 1000]\
.sort_values(by = 'averagerating',ascending = False)

In [114]:
#Cleaning up column names
avg_ratings_per_genre_filtered = avg_ratings_per_genre_filtered.rename(columns = {'averagerating': 'Average_Rating',
                                                                                  'numvotes': 'Num_Ratings',
                                                                     'num_movies_per_genre': 'Num_Movies_Per_Genre'})


avg_ratings_per_genre_filtered = avg_ratings_per_genre_filtered.rename_axis('Genre')
#62135 ENTRIES

### Preparing the DataFrame for Insight 1b: Average Total Revenue Per Genre

I create a new DataFrame called avg_tot_revenue_per_genre_filtered to analyze the average total revenue rating per genre in the years 2010-2018.

In [117]:
#2010 - 2018
#Creating a new DataFrame called avg_tot_revenue_per_genre 
#by merging the joined_mb_mr DataFrame and movie_gross_df DataFrame
avg_tot_revenue_per_genre = pd.merge(joined_mb_mr, movie_gross_df, left_on = 'primary_title', right_on = 'title')

#Dropping the duplicate titles
avg_tot_revenue_per_genre = avg_tot_revenue_per_genre.drop_duplicates('primary_title')

#Dropping the redundant duplicate title column
avg_tot_revenue_per_genre = avg_tot_revenue_per_genre.drop(labels = 'title', axis = 1)

#Dropping start_year and keeping the year column
avg_tot_revenue_per_genre = avg_tot_revenue_per_genre.drop(labels = 'start_year', axis = 1)

In [841]:
#Creating a new filtered DataFrame that only includes the averages of
#total revenue, domestic gross, and foreign gross grouped by genre
avg_tot_revenue_per_genre_filtered = pd.DataFrame(avg_tot_revenue_per_genre.groupby
                                                  ('genres1')[['total_revenue', 'domestic_gross', 'foreign_gross']]
                                                  .mean())

#Sorting the DataFrame by total revenue in descending order
avg_tot_revenue_per_genre_filtered = avg_tot_revenue_per_genre_filtered\
                                    .sort_values(by = 'total_revenue', ascending = False)

#Cleaning up column names
avg_tot_revenue_per_genre_filtered = avg_tot_revenue_per_genre_filtered.rename\
                                            (columns = {'total_revenue': 'Average_Total_Revenue',
                                                        'domestic_gross': 'Domestic_Gross',
                                                        'foreign_gross': 'Foreign_Gross'})


avg_tot_revenue_per_genre_filtered = avg_tot_revenue_per_genre_filtered.rename_axis('Genre')
#1541 ENTRIES

### Preparing the Series for Insight 2: Runtime Per Rating Score

I create three new Series using data from 2010 - 2019: runtime_per_rating_all_genres, runtime_per_rating_action, and runtime_per_rating_adventure to look at the average runtime for all genres and the two genres that tend to produce the most total revenue (action and adventure) across three rating categories.

To do so, I categorize movie ratings as Good, Average, or Bad.

* Good = A rating equal to or greater than 7                                                                           
* Average = A rating less than 7 and greater than or equal to 5                                                       
* Bad = A rating less than 5

In [831]:
#Creating a new DataFrame called runtime_per_rating
runtime_per_rating = pd.DataFrame(joined_mb_mr)

In [835]:
#Creating 'Rating' categories on a Good, Average, or Bad scale

#If the averagerating of a movie is greater than or equal to 7, it is considered 'Good'
runtime_per_rating.loc[runtime_per_rating['averagerating'] >= 7, 'Rating'] = 'Good'

#If the averagerating of a movie is less than 7 and greater than or equal to 5, it is considered 'Average'
runtime_per_rating.loc[(runtime_per_rating['averagerating'] < 7) & \
                        (runtime_per_rating['averagerating'] >= 5), 'Rating'] = 'Average'

#If the averagerating of a movie is less than 5, it is considered 'Bad'
runtime_per_rating.loc[runtime_per_rating['averagerating']  < 5, 'Rating'] = 'Bad'

In [836]:
#Limiting the entries to just those whose runtime is between 1 hour and 4 hours
runtime_per_rating_filtered = runtime_per_rating.loc[(runtime_per_rating['runtime_minutes'] <= 240) \
                                                     & (runtime_per_rating['runtime_minutes'] > 60)]\
                                                        .sort_values(by = 'runtime_minutes',ascending = False)

In [839]:
#Looking at the mean runtime of all genres of movies across the different ratings categories
runtime_per_rating_all_genres = runtime_per_rating_filtered.groupby('Rating')['runtime_minutes'].mean()\
                                                                            .sort_values(ascending = False)
runtime_per_rating_all_genres

#61244 ENTRIES

Rating
Average   97.30589
Good      96.49493
Bad       93.83044
Name: runtime_minutes, dtype: float64

In [563]:
#Looking at the mean runtime for Action movies across the different ratings categories
runtime_per_rating_action = runtime_per_rating_filtered.loc[runtime_per_rating_filtered['genres1'] == 'Action']\
                                                                    .groupby('Rating')['runtime_minutes'].mean()
runtime_per_rating_action

#6055 ENTRIES

Rating
Average   107.87626
Bad        99.49795
Good      108.90824
Name: runtime_minutes, dtype: float64

In [186]:
#Looking at the mean runtime for Adventure movies across the different ratings categories
runtime_per_rating_adventure = runtime_per_rating_filtered.loc[runtime_per_rating_filtered['genres1'] == 'Adventure']\
                                                                        .groupby('Rating')['runtime_minutes'].mean()
runtime_per_rating_adventure

#2855 ENTRIES

Rating
Average   93.67950
Bad       91.88011
Good      93.90762
Name: runtime_minutes, dtype: float64

In [569]:
#to_concat = [runtime_per_rating_all_genres, runtime_per_rating_action, runtime_per_rating_adventure]
#all_3 = pd.concat(to_concat)
#all_3

### Preparing the DataFrames for Insight 3: Directors of the Most Successful Movies

I create two new DataFrames: most_successful_avg_rev and most_successful_single_rev to analyze which directors directed the most successful movies in 2010 - 2018 from a monetary standpoint.

I look at both the top 15 directors whose movies produced the highest mean revenue as well as the directors of the top 15 movies with the highest revenue.

In [777]:
#Creating new dataset that merges directors_df and persons_df from the imdb data
#which will eventually merge with avg_tot_revenue_per_genre

#Merging directors_df with persons_df
director_info = pd.merge(directors_df, persons_df, left_on = 'person_id', right_on = 'person_id')

#Dropping the columns birth_year and death_year because they are not relevant to our analysis
director_info = director_info.drop('birth_year', axis = 1)
director_info = director_info.drop('death_year', axis = 1)

#Merging director_info with avg_tot_revenue_per_genre
most_successful = pd.merge(director_info, avg_tot_revenue_per_genre,\
                           left_on = 'movie_id', right_on = 'movie_id')

#Dropping duplicate movie ids
most_successful = most_successful.drop_duplicates('movie_id')

#Dropping more columns that are not relevant to our analysis
most_successful = most_successful.drop(columns = ['original_title', 'genres', 'numvotes'])

#1541 ENTRIES

In [843]:
#Making a DataDrame of the directors with the highest mean revenue in 2010-2018
most_successful_avg_rev = most_successful.groupby('primary_name')['total_revenue']\
                                            .mean().sort_values(ascending = False).head(15)

most_successful_avg_rev = most_successful_avg_rev.to_frame()
most_successful_avg_rev = most_successful_avg_rev.reset_index()

#Making a DataFrame of the directors with the highest revenue from a single movie in 2010-2018
most_successful_single_rev = most_successful[['primary_name','primary_title','year','total_revenue',]]\
                            .sort_values(by = 'total_revenue', ascending = False).head(15)

#Creating a list of directors that appear in both of the above DataFrames
best_directors = []
for x in most_successful_avg_rev['primary_name']:
    if x in most_successful_single_rev['primary_name'].values:
            best_directors.append(x)

### Preparing the DataFrames for Insight 4: Actors and Actresses from the Most Successful Movies

I create two new DataFrames: best_actors_avg_rev and best_actors_single_rev to analyze which actors/actresses were in the most successful movies in 2010 - 2018 from a monetary standpoint.

Like I do with the directors, I look at both the top 15 actors/actresses whose movies produced the highest mean revenue as well as the actors/actresses of the top 15 movies with the highest revenue.

In [857]:
#Creating new dataset that merges principals_df and persons_df from the imdb data
#which will eventually merge with avg_tot_revenue_per_genre
actors = pd.merge(principals_df, persons_df, left_on = 'person_id', right_on = 'person_id')
actors = pd.merge(actors, avg_tot_revenue_per_genre,\
                           left_on = 'movie_id', right_on = 'movie_id')

#Dropping duplicate movie ids
actors = actors.drop_duplicates('movie_id')

#Dropping columns that are not relevant to our analysis
actors = actors.drop(columns = ['original_title', 'genres', 'numvotes', 'job', 'ordering', 
                                'birth_year', 'death_year', 'primary_profession', 'runtime_minutes'])

#Editing the DataFrame so that it only shows entries for people who are actors/actresses
actors = actors.loc[(actors['category'] == 'actor')|(actors['category'] == 'actress')]

In [845]:
#Making a DataFrame of the actors/actresses from movies with the highest mean revenue in 2010-2018
best_actors_avg_rev = actors.groupby('primary_name')['total_revenue']\
                                            .mean().sort_values(ascending = False).head(15)
best_actors_avg_rev = best_actors_avg_rev.to_frame()
best_actors_avg_rev = best_actors_avg_rev.reset_index()

#Making a DataFrame of the actors/actresses from movies with the highest revenue from a single movie in 2010-2018
best_actors_single_rev = actors[['primary_name','primary_title','year','total_revenue',]]\
                            .sort_values(by = 'total_revenue', ascending = False).head(15)

#Creating a list of actors/actresses that appear in both of the above DataFrames
best_actors = []
for x in best_actors_avg_rev['primary_name']:
        if x in best_actors_single_rev['primary_name'].values:
            best_actors.append(x)
best_actors

['Daisy Ridley',
 'Rafe Spall',
 'Jason Momoa',
 'Alan Tudyk',
 "Ed O'Neill",
 'Sandra Bullock',
 'Javier Bardem',
 'Chris Evans']

## Data Analysis

### Insight 1a: Average Rating Per Genre

In [846]:
#Creating a bar chart with genres as the x-axis values and the average rating as the y-axis values
#I added a color scale to show the number of ratings that each genre had 
#I also added text to the graph to show the values more clearly

fig = px.bar(avg_ratings_per_genre_filtered, x=avg_ratings_per_genre_filtered.index, y='Average_Rating',
            hover_data=['Num_Ratings'], text = 'Average_Rating', color='Num_Ratings', labels={
                     "Average_Rating": "Average Rating on a 1-10 Scale",
                     "Genre": "Genre",
                     "Num_Ratings": "Number of Ratings"
                 },
                title="Average Rating Per Genre")

#Implementing the text we added above
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_layout(uniformtext_minsize=7, uniformtext_mode='hide')
fig.update_yaxes(range=[0, 8])
fig.show()

### Insight 1b: Total Revenue Per Genre

In [847]:
#Creating a bar chart with genres as the x-axis values and the average total revenue as the y-axis values
#I added text to the graph to show the values more clearly
fig = px.bar(avg_tot_revenue_per_genre_filtered, 
            x=avg_tot_revenue_per_genre_filtered.index,
            y='Average_Total_Revenue',
            text = 'Average_Total_Revenue',
            labels={"Average_Total_Revenue": "Average Total Revenue",
                    "Genre": "Genre"},
            title="Average Total Revenue Per Genre")

#Implementing the text we added above
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_layout(uniformtext_minsize=7, uniformtext_mode='hide')
fig.update_yaxes(range=[0, 300000000])
fig.show()

### Insight 2: Runtime Per Rating Score

In [848]:
#Making the subplots with their respective titles
fig = make_subplots(rows=1, cols=3, subplot_titles= ['All Genres', 'Action', 'Adventure'], \
                                                horizontal_spacing = 0.15)

#Adding the appropriate x and y values to the first subplot,
#assigning each x-value a corresponding color, and adding text
fig.add_trace(
    go.Bar(
        x=runtime_per_rating_all_genres.index,
        y=runtime_per_rating_all_genres.values,
        hoverlabel=dict(namelength=0),
        marker=dict(color=['gold', 'crimson', 'forestgreen']),
        text = runtime_per_rating_all_genres.values
   ),
    row=1, col=1
)

#Adding the appropriate x and y values to the second subplot, 
#assigning each x-value a corresponding color, and adding text

fig.add_trace(
    go.Bar(
        x=runtime_per_rating_action.index,
        y=runtime_per_rating_action.values,
        hoverlabel=dict(namelength=0),
        marker=dict(color=['gold', 'crimson', 'forestgreen']),
        text = runtime_per_rating_action.values
    ),
    row=1, col=2
)

#Adding the appropriate x and y values to the third subplot,
#assigning each x-value a corresponding color, and adding text
fig.add_trace(
    go.Bar(
        x=runtime_per_rating_adventure.index,
        y=runtime_per_rating_adventure.values,
        hoverlabel=dict(namelength=0),
        marker=dict(color=['gold', 'crimson', 'forestgreen']),
        text = runtime_per_rating_adventure.values
    ),
    row=1, col=3,
)

#Ordering the x-values for each subplot
fig.update_layout(xaxis1={'categoryorder':'array', 'categoryarray':['Good','Average','Bad']},
                  xaxis2={'categoryorder':'array', 'categoryarray':['Good','Average','Bad']},
                  xaxis3={'categoryorder':'array', 'categoryarray':['Good','Average','Bad']},
                  title_text="Mean Runtime Per Rating Category",
                  showlegend = False
                  )

#Creating the x axis titles for each subplot
fig.update_xaxes(title_text="Rating Categories", row=1, col=1)
fig.update_xaxes(title_text="Rating Categories", row=1, col=2)
fig.update_xaxes(title_text="Rating Categories", row=1, col=3)

#Creating the y axis titles for each subplot and standardizing the range of values
fig.update_yaxes(title_text="Runtime in Minutes", row=1, col=1, range=[0, 120])
fig.update_yaxes(title_text="Runtime in Minutes", row=1, col=2, range=[0, 120])
fig.update_yaxes(title_text="Runtime in Minutes", row=1, col=3, range=[0, 120])

#Implementing the text that we added to each subplot
fig.update_traces(texttemplate='%{text:.4s}', textposition='outside')
fig.update_layout(uniformtext_minsize=7, uniformtext_mode='hide')


fig.show()

In [849]:
#fig = go.Figure(go.Bar(x=runtime_per_rating_all_genres.index, y=runtime_per_rating_all_genres.values, name = 'All Genres'))
#fig.add_trace(go.Bar(x=runtime_per_rating_action.index, y=runtime_per_rating_action.values, name='Action'))
#fig.add_trace(go.Bar(x=runtime_per_rating_adventure.index, y=runtime_per_rating_adventure.values, name='Adventure'))

#fig.update_layout(barmode='group')
#fig.update_xaxes(categoryorder='array', categoryarray= ['Good','Average','Bad'])
#fig.show()

### Insight 3: Directors of the Most Successful Movies

In [850]:
#Making the subplots with their respective titles
fig = make_subplots(rows=1, cols=2, subplot_titles= ['Mean Revenue From Movies in 2010-2018',
                                                     'Revenue From A Single Movie in 2010-2018'])

#Adding the appropriate x and y values to the first subplot
fig.add_trace(
    go.Bar(
        x=most_successful_avg_rev['primary_name'],
        y=most_successful_avg_rev['total_revenue'],
        hoverlabel=dict(namelength=0)
    ),
    row=1, col=1
)

#Adding the appropriate x and y values to the second subplot
fig.add_trace(
    go.Bar(
        x=most_successful_single_rev['primary_name'],
        y=most_successful_single_rev['total_revenue'],
        hoverlabel=dict(namelength=0)
    ),
    row=1, col=2
)

#Editing the layout of the subplots
fig.update_layout(showlegend = False)

#Creating the x axis titles for each subplot
fig.update_xaxes(title_text="Director", row=1, col=1)
fig.update_xaxes(title_text="Director", row=1, col=2)


#Creating the y axis titles for each subplot and standardizing the range of values
fig.update_yaxes(title_text="Revenue", row=1, col=1, range = [0, 2300000000])
fig.update_yaxes(title_text="Revenue", row=1, col=2, range = [0, 2300000000])

In [851]:
#BEST DIRECTORS: Directors who appear in both graphs
best_directors

#BUT! Joss Whedon and Adam Green only directed one movie in the years 2010-2018, 
#so that is why their mean is so high, which is important to keep in mind

#So, Sam Mendes, Michael Bay, Lee Unkrich, Pierre Coffin, Anthony Russo, and Christopher Nolan
#not only directed a top 15 successful movie from this dataset,
#but they also directed other successful films in 2010-2018 
#thus leading these directors to also be among the top 15 directors
#whose movies have produced the highest mean revenue

['Joss Whedon',
 'Adam Green',
 'Sam Mendes',
 'Michael Bay',
 'Lee Unkrich',
 'Pierre Coffin',
 'Anthony Russo',
 'Christopher Nolan']

### Insight 4: Actors and Actresses from the Most Successful Movies

In [852]:
#Making the subplots with their respective titles
fig = make_subplots(rows=1, cols=2, subplot_titles= ['Mean Revenue From Movies in 2010-2018',
                                                     'Revenue From A Single Movie in 2010-2018'])

#Adding the appropriate x and y values to the first subplot
fig.add_trace(
    go.Bar(
        x=best_actors_avg_rev['primary_name'],
        y=best_actors_avg_rev['total_revenue'],
        hoverlabel=dict(namelength=0)
    ),
    row=1, col=1
)

#Adding the appropriate x and y values to the second subplot
fig.add_trace(
    go.Bar(
        x=best_actors_single_rev['primary_name'],
        y=best_actors_single_rev['total_revenue'],
        hoverlabel=dict(namelength=0)
    ),
    row=1, col=2
)

#Editing the layout of the subplots
fig.update_layout(showlegend = False)

#Creating the x axis titles for each subplot
fig.update_xaxes(title_text="Actor", row=1, col=1)
fig.update_xaxes(title_text="Actor", row=1, col=2)


#Creating the y axis titles for each subplot and standardizing the range of values
fig.update_yaxes(title_text="Revenue", row=1, col=1, range = [0, 2600000000])
fig.update_yaxes(title_text="Revenue", row=1, col=2, range = [0, 2600000000])

In [853]:
#BEST ACTORS: Actors who appear in both graphs
best_actors

#BUT! Daisy Ridley, Rafe Spall, Jason Momoa, Alan Tudyk, and Ed O'Neill were only in one movie in the years 2010-2018, 
#so that is why their mean is so high, which is important to keep in mind

#So, Sandra Bullock, Javier Bardem, and Chris Evans 
#were not only in a top 15 successful movie from this dataset,
#but they also were in other successful films in 2010-2018 
#thus leading these actors to also be among the top 15 actors
#whose movies have produced the highest mean revenue

['Daisy Ridley',
 'Rafe Spall',
 'Jason Momoa',
 'Alan Tudyk',
 "Ed O'Neill",
 'Sandra Bullock',
 'Javier Bardem',
 'Chris Evans']