# Exploratory Data Analysis of Disney Movies Datasets

## Introduction

The purpose of this analysis project is to determine which genre of Disney movie is the most successful by exploring the total gross revenue for each genre. As Disney continues to produce movies every year, this information can help Disney decide which genre of movie to produce next in order to obtain a high gross revenue.  

The dataset we will be working with was obtained from [data world](https://data.world/kgarrett/disney-character-success-00-16). It contains information about each Disney movie released between 1937 and 2016 and includes the following columns:

* `movie_title`: title of the movie
* `release_date`: movie release date
* `genre`: movie genre (musical, adventure, drama, etc.)
* `MPAA_rating`: movie rating (G, PG, PG-13, R, Not Rated)
* `total_gross`: total gross revenue ($\$ $)
* `inflation_adjusted_gross`: total gross revenue adjusted with inflation over the years ($\$ $)

## Methods & Results

First, we import all necessary libraries and functions, and format our files with Black.

In [1]:
import pandas as pd
import altair as alt
from replace_str import replace_str
!black final_project.ipynb;
!black replace_str.py;
!black test_replace_str.py;

[1mreformatted final_project.ipynb[0m
[1mAll done! ✨ 🍰 ✨[0m
[1m1 file reformatted[0m.[0m
[1mreformatted replace_str.py[0m
[1mAll done! ✨ 🍰 ✨[0m
[1m1 file reformatted[0m.[0m
[1mAll done! ✨ 🍰 ✨[0m
1 file left unchanged.[0m


Next, we read in and preview the raw data.

In [2]:
disney_data = pd.read_csv("data/disney_movies_total_gross.csv")
disney_data.head()

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,"Feb 9, 1940",Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,"Nov 13, 1940",Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,"Nov 12, 1946",Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,"Feb 15, 1950",Drama,G,"$85,000,000","$920,608,730"


**Table 1: Raw Disney Data**

We can learn more about the dataset by using `.info()` function.

In [3]:
disney_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   movie_title               579 non-null    object
 1   release_date              579 non-null    object
 2   genre                     562 non-null    object
 3   MPAA_rating               523 non-null    object
 4   total_gross               579 non-null    object
 5   inflation_adjusted_gross  579 non-null    object
dtypes: object(6)
memory usage: 27.3+ KB


We can see that each column has dtype `object` i.e. `str` type. We can also see that `genre` and `MPAA_rating` columns contain `NA` values. Since we want to find which movie genre produced the highest revenue, we can remove rows with `NA` values for genre. To clean our data, we will perform the following steps:

1. Drop `NA` values from `genre` column
2. Remove `$` and `,` from `total_gross` & `inflation_adjusted_gross` columns
3. Change `total_gross` & `inflation_adjusted_gross` columns to `int` dtype
4. Change `release_date` column to `datetime` type

In [4]:
disney_data = disney_data.dropna(subset=['genre'])

In [5]:
replace_str(disney_data, 'total_gross', '$', '');
replace_str(disney_data, 'total_gross', ',', '');
replace_str(disney_data, 'inflation_adjusted_gross', '$', '');
replace_str(disney_data, 'inflation_adjusted_gross', ',', '');

In [6]:
disney_data['total_gross'] = disney_data['total_gross'].astype('int')

disney_data['inflation_adjusted_gross'] = disney_data['inflation_adjusted_gross'].astype('int')

In [7]:
disney_data = disney_data.assign(release_date = pd.to_datetime(disney_data['release_date']))

In [8]:
disney_data.head()

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
4,Cinderella,1950-02-15,Drama,G,85000000,920608730


**Table 2: Clean Disney Data**

Now that our data is clean, we can start to explore it a bit further. Let's take a look at how many movies there are for each genre with a bar plot.

In [9]:
genre_freq = alt.Chart(disney_data, width=500, height=300).mark_bar().encode(
    x=alt.X('genre:N', title="Genre"), 
    y=alt.Y('count():Q', title="Count") 
).properties(title="Figure 1: Frequency of Movie Genres")
genre_freq

Figure 1 shows that Comedy, Adventure and Drama are three of the most common genres. Based on this, we predict that one of these three genres will produce the highest gross revenue.

We will now explore our other variable of interest: gross revenue. Since the dataset includes inflation adjusted gross revenue and the original gross revenue, we need to determine which one we want to use. First, we will create a new dataframe that has the mean gross and inflation adjusted gross values per year, by using `groupby()`.

In [10]:
disney_data = disney_data.assign(year = pd.DatetimeIndex(disney_data['release_date']).year)
year_groups = disney_data.groupby(by='year')
disney_years = year_groups.mean().reset_index()
disney_years.head()

Unnamed: 0,year,total_gross,inflation_adjusted_gross
0,1937,184925485.0,5228953000.0
1,1940,83810000.0,2187660000.0
2,1946,65000000.0,1078511000.0
3,1950,85000000.0,920608700.0
4,1954,28200000.0,528280000.0


**Table 3: Mean Orginal and Inflation Adjusted Gross Revenue per Year**

Now we can examine the distribution for both the original gross revenue and inflation adjusted gross revenue variables.

In [11]:
original_gross_plot = alt.Chart(disney_years, width=500, height=300).mark_bar().encode(
    x=alt.X('year:N', title="Release Year", bin=alt.Bin(maxbins=20)), 
    y=alt.Y('total_gross:Q', title="Mean Gross Revenue") 
).properties(title="Figure 2: Distribution of Original Gross")
original_gross_plot

In [12]:
inflation_gross_plot = alt.Chart(disney_years, width=500, height=300).mark_bar().encode(
    x=alt.X('year:N', title="Release Year", bin=alt.Bin(maxbins=20)), 
    y=alt.Y('inflation_adjusted_gross:Q', title="Mean Inflation Adjusted Gross Revenue") 
).properties(title="Figure 3: Distribution of Inflation Adjusted Gross")
inflation_gross_plot 

We can see in Figures 2 & 3 that there are obvious differences between the original gross revenues and the inflation adjusted gross revnues. For example, the mean inflation adjusted gross revenue for years 1935-1940 is actually much higher than the mean original gross revenue. Therefore, we will use the inflation adjusted gross revenue to ensure that the gross revenue is fairly balanced between all of the years. 

Now, we can group by genre to determine which genre has the highest mean inflation adjusted gross revenue.

In [13]:
genre_groups = disney_data.groupby(by='genre')
disney_genre = genre_groups.mean().reset_index().sort_values(
    by="inflation_adjusted_gross", ascending = False).loc[
    :, ['genre','inflation_adjusted_gross','total_gross']]
disney_genre

Unnamed: 0,genre,inflation_adjusted_gross,total_gross
8,Musical,603597900.0,72330260.0
1,Adventure,190397400.0,127047100.0
0,Action,137473400.0,104614100.0
10,Thriller/Suspense,89653790.0,58616940.0
3,Comedy,84667730.0,44613290.0
9,Romantic Comedy,77777080.0,50095950.0
11,Western,73815710.0,51287350.0
6,Drama,71893020.0,36026080.0
4,Concert/Performance,57410840.0,51728230.0
2,Black Comedy,52243490.0,32514400.0


**Table 4: Mean Gross Revenue by Movie Genre**

Table 4 shows us that Musical genre has the highest inflation adjusted gross revenue, while Adventure and Action are the runners up. It also worth noticing that Action and Adventure have the highest orignal gross revenue. 

We can demonstrate our findings with another bar plot.

In [14]:
genre_gross_plot = alt.Chart(disney_genre, width=500, height=300).mark_bar().encode(
    x=alt.X('genre:N', title="Genre", sort="y"), 
    y=alt.Y('inflation_adjusted_gross:Q', title="Mean Inflation Adjusted Gross Revenue") 
).properties(title="Figure 4: Gross Revenue by Genre")
genre_gross_plot 

Figure 4 demonstrates that Musicals have 3 times higher gross revenue than the second highest genre, Adventure.

Finally, we can display the Disney movies with Musical genres to get an idea of which movies contributed to the high gross revnue.

In [15]:
disney_data[disney_data['genre'] == 'Musical'].sort_values(
    by="inflation_adjusted_gross", ascending=False)

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross,year
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251,1937
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808,1940
13,The Jungle Book,1967-10-18,Musical,Not Rated,141843000,789612346,1967
114,Beauty and the Beast,1991-11-13,Musical,G,218951625,363017667,1991
15,The Aristocats,1970-04-24,Musical,G,55675257,255161499,1970
553,Into the Woods,2014-12-25,Musical,PG,128002372,130894237,2014
10,Babes in Toyland,1961-12-14,Musical,G,10218316,124841160,1961
474,High School Musical 3: Senior Year,2008-10-24,Musical,G,90559416,106308538,2008
161,The Nightmare Before Christmas,1993-10-13,Musical,PG,50408318,100026637,1993
321,Fantasia 2000 (IMAX),2000-01-01,Musical,G,60507228,94852354,2000


**Table 5: Disney Musicals**

Table 5 shows that *Snow White and the Seven Dwarfs* had the highest inflation adjusted gross revenue out of all of the Disney Musicals.

## Discussion

This analysis demonstrates that Musical genre has the highest inflation adjusted gross revenue out of the other movie genres. Therefore, we can conclude that Musicals are the most successful Disney movie genre, followed by Adventure and Action.

We originally predicted that Comedy, Adventure or Drama would have the highest gross revenue since they were the most common Disney movie genres. Musicals were one of the less common genres so we were not expecting it to have such a high gross revenue. However, Table 5 lists the musicals with the highest inflation adjusted gross revnue including *Snow White and the Seven Dwarfs*, *Fantasia*, *The Jungle Book* and *Beauty and the Beast*. These are very well known stories and fairytales, so it makes sense that these movies had significantly high revenues. 

These findings can help Disney choose the genre of their next movie. Since we determined that musicals produced the highest gross revenue, releasing another musical could potentially result in a high revenue amount similar to some of the other past musicals. Furthermore, Adventure and Action had the second and third highest inflation adjusted revenues, so these genres are also excellent options. 

This analysis also opens the door to other questions such as which directors and actors contributed to movies with the highest revenue? Additionally, is there a difference in revenue between animated movies and live action movies? These questions can help Disney decide exactly who to cast and what kind of movie they should produce next in order to ensure they obtain a high gross revenue.

## References

The data used in this analysis project was obtained from:

https://data.world/kgarrett/disney-character-success-00-16