# Exploratory Data Analysis of the Disney Datasets - What Makes Disney the Most Money?

#### Marina Galvao

## Introduction

### Questions of Interest

In my analysis, I will be investigating what makes a Disney movie successful. 

My questions are:

1) On average, what genre of movies tends to have the highest inflation-adjusted gross? 
2) Do the movies with the highest inflation-adjusted gross all come from the same genre? 
3) Do the movies with the highest inflation-adjusted gross all come from the same director who has an in-depth understanding of the features that make for our most beloved characters?
4) Is the timeline under which a movie is released related to its inflation-adjusted gross?

This is interesting because it helps us understand which elements influence box office performance and it gives us an idea as to what degree these elements impact a movie's success. 

### Dataset Description

The dataset we are working with is from [Kaggle](https://www.kaggle.com/datasets/maricinnamon/walt-disney-character-dataset?select=disney-characters.csv%29) and includes five tables containing information about Disney. We will be working with the disney_movies_total_gross table and the disney-director table.

- disney_movies_total_gross.csv: Listed here in the order of the columns, the file contains a column for the movie title, the movie release date, the movie genre, the movie MPAA rating, total gross value earned from the movie, and the inflation-adjusted total gross value earned (which we will be using as this measure is comparable across timelines).

- disney-director.csv: Listed here in the order of the columns, the file contains a column for the movie title, and a column for the director's name.

### Methods and Results

In [1]:
# First, let's import the libraries needed for the analysis

import pandas as pd
import altair as alt

# Now, let's import the required files

disney_gross = pd.read_csv('data/disney_movies_total_gross.csv')
disney_director = pd.read_csv('data/disney-director.csv')

#### Table 1. Disney Inflation Adjusted Gross File, First 5 Rows

In [2]:
# Let's have a look at each file, starting with the disney_gross file's first 5 rows

disney_gross.head()

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,"Feb 9, 1940",Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,"Nov 13, 1940",Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,"Nov 12, 1946",Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,"Feb 15, 1950",Drama,G,"$85,000,000","$920,608,730"


#### Table 2. Summary of Disney Director File, First 5 Rows

In [3]:
# Now the disney_director file

disney_director.head()

Unnamed: 0,name,director
0,Snow White and the Seven Dwarfs,David Hand
1,Pinocchio,Ben Sharpsteen
2,Fantasia,full credits
3,Dumbo,Ben Sharpsteen
4,Bambi,David Hand


Here, we can observe that in the disney_gross file, the movie name is referred to as "movie_title". On the disney_director file, however, the movie name is referred to as "name"

In [4]:
# Now let's get some info about each file

# Starting with the disney_gross dataframe

disney_gross.info()
disney_gross.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   movie_title               579 non-null    object
 1   release_date              579 non-null    object
 2   genre                     562 non-null    object
 3   MPAA_rating               523 non-null    object
 4   total_gross               579 non-null    object
 5   inflation_adjusted_gross  579 non-null    object
dtypes: object(6)
memory usage: 27.3+ KB


movie_title                 object
release_date                object
genre                       object
MPAA_rating                 object
total_gross                 object
inflation_adjusted_gross    object
dtype: object

The result shows us the column types in this dataframe are all objects, and that some columns have null values (genre and MPAA_rating).

In [5]:
# Now the disney_director dataframe

disney_director.info()
disney_director.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      56 non-null     object
 1   director  56 non-null     object
dtypes: object(2)
memory usage: 1.0+ KB


name        object
director    object
dtype: object

Here we see we also have a dataframe with objects as the column types, but unlike the previous one, we do not see any null values.

To start off our analysis, I want to manipulate the data to help me answer the first question previously presented:
1) On average, what genre of movies tends to have the highest inflation-adjusted gross? 

To do so, I need to group the disney_gross file by genre, and calculate the mean of the inflation_adjusted_gross for each genre. This way I can see which genre is most successful on average. 

Since we saw some nan values are present in the genre column of disney_gross, we will need to get rid of these in order to answer our question.

#### Table 3. Summary of Genre and Mean Inflation-Adjusted Gross

In [6]:
# Dropping nan values from disney_gross genre column as we don't want to include missing values
disney_gross = disney_gross.dropna(subset=['genre'])

# We first need to convert 'inflation_adjusted_gross' to numeric; we remove dollar signs and commas, then convert to float

disney_gross['inflation_adjusted_gross'] = disney_gross['inflation_adjusted_gross'].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Now, let's group by genre and calculate the mean inflation_adjusted_gross for each group, sorting from highest to lowest
# We will also reset the index and rename the column from 'inflation_adjusted_gross' to 'mean_inflation_adjusted_gross' for better clarity

grouped_by_genre = disney_gross.groupby('genre')['inflation_adjusted_gross'].mean()
grouped_by_genre_df = grouped_by_genre.sort_values(ascending=False).reset_index()
grouped_by_genre_df = grouped_by_genre_df.rename(columns={'inflation_adjusted_gross': 'mean_inflation_adjusted_gross'})
grouped_by_genre_df

Unnamed: 0,genre,mean_inflation_adjusted_gross
0,Musical,603597900.0
1,Adventure,190397400.0
2,Action,137473400.0
3,Thriller/Suspense,89653790.0
4,Comedy,84667730.0
5,Romantic Comedy,77777080.0
6,Western,73815710.0
7,Drama,71893020.0
8,Concert/Performance,57410840.0
9,Black Comedy,52243490.0


From looking at the table above, we can see the genre 'Musical' results in the highest inflation adjusted gross. Let's create a chart for this so we can have a better look. 

In [7]:
# Let's use Altair here and create a bar graph with Movie Genre in the x-axis and Mean Inflation-Adjusted Gross in $ in the y-axis

mean_inflation_chart = alt.Chart(grouped_by_genre_df).mark_bar().encode(
    x=alt.X(
        'genre:N',
        title='Movie Genre',
        sort='-y'
    ),
    y=alt.Y(
        'mean_inflation_adjusted_gross:Q',
        title='Mean Inflation-Adjusted Gross in $',
        sort='-x'
    )
).properties(
    width=500,
    height=300,
    title='Disney Movie Genre and Mean Inflation-Adjusted Gross'
)

mean_inflation_chart

#### Figure 1. Disney Movie Genre and Mean Inflation-Adjusted Gross

There is an obvious discrepancy between musicals and other genres, with musicals having the highest mean inflation adjusted gross (followed by adventure, action, thriller/suspense, comedy, romantic comedy, western, drama, concert/performance, black comedy, horror, and documentary). 

Now let's tackle our second question:

2) Do the movies with the highest inflation-adjusted gross all come from the same genre? 

To understand this, we can order the disney_gross data from highest inflation-adjusted gross to lowest, and see if the highest profitting movies have a tendency to be musicals. 

#### Table 4. Top 10 Movies with Highest Inflation-Adjusted Gross

In [8]:
# Let's do this for the top 10 movies, sorting by inflation-adjusted gross

sorted_by_inflation = disney_gross.sort_values(by='inflation_adjusted_gross', ascending=False).head(10)
sorted_by_inflation

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,"$184,925,485",5228953000.0
1,Pinocchio,"Feb 9, 1940",Adventure,G,"$84,300,000",2188229000.0
2,Fantasia,"Nov 13, 1940",Musical,G,"$83,320,000",2187091000.0
8,101 Dalmatians,"Jan 25, 1961",Comedy,G,"$153,000,000",1362871000.0
6,Lady and the Tramp,"Jun 22, 1955",Drama,G,"$93,600,000",1236036000.0
3,Song of the South,"Nov 12, 1946",Adventure,G,"$65,000,000",1078511000.0
564,Star Wars Ep. VII: The Force Awakens,"Dec 18, 2015",Adventure,PG-13,"$936,662,225",936662200.0
4,Cinderella,"Feb 15, 1950",Drama,G,"$85,000,000",920608700.0
13,The Jungle Book,"Oct 18, 1967",Musical,Not Rated,"$141,843,000",789612300.0
179,The Lion King,"Jun 15, 1994",Adventure,G,"$422,780,140",761640900.0


We can see that out of the ten rows, three are musicals (the 1st, 3rd, and 9th row). Since there is not an obvious trend towards musicals being the most profitable here, it may be that some values such as the first movie and third movie are skewing the mean from the data, and that being a musical is not the key to being successful as a movie. 

In [9]:
# Let's graph this data as a bar graph to see what this might show us 

sorted_by_inflation_chart =  alt.Chart(sorted_by_inflation, 
                             ).mark_bar().encode(x=alt.X('movie_title:N', 
                             title='Movie Title', sort='-y'), y=alt.Y('inflation_adjusted_gross:Q', 
                             title='Inflation-Adjusted Gross in $', sort='-x'
                             )).properties(width=500, height=300, title='Disney Movie and Inflation-Adjusted Gross')

sorted_by_inflation_chart

#### Figure 2. Disney Movie and Inflation-Adjusted Gross

The graph shows Snow White and the Seven Dwarfs with the highest value (5.228953e+09) followed by Pinocchio, Fantasia, 101 Dalmatians, Lady and the Tramp, Song of the South, Star Wars Ep. VII: The Force Awakens, Cinderella, The Jungle Book, and The Lion King. The result shows just how significant the difference is between Snow White and the Seven Dwarfs and other movies.

#### Table 5. Disney Director Table, Column Renamed

Now let's tackle our third question: 

3) Do the movies with the highest inflation-adjusted gross all come from the same director who has an in-depth understanding of the features that make for our most beloved characters?

In [10]:
# In order to answer this, let's merge the disney_director dataset with the disney_gross dataset using the name of the movies
# In doing so, we can see if the movies with the highest inflation-adjusted gross belong to the same directors 

# But first, let's re-name the column 'name' to 'movie_title' in the disney_director dataframe, 
# so that we can merge the data without any issues

disney_director = disney_director.rename(columns={'name': 'movie_title'})

disney_director

Unnamed: 0,movie_title,director
0,Snow White and the Seven Dwarfs,David Hand
1,Pinocchio,Ben Sharpsteen
2,Fantasia,full credits
3,Dumbo,Ben Sharpsteen
4,Bambi,David Hand
5,Saludos Amigos,Jack Kinney
6,The Three Caballeros,Norman Ferguson
7,Make Mine Music,Jack Kinney
8,Fun and Fancy Free,Jack Kinney
9,Melody Time,Clyde Geronimi


#### Table 6. Data for Inflation-Adjusted Gross and Movie Director Merged

In [11]:
# Now, we can go ahead and merge

combined_df = pd.merge(disney_director, disney_gross, on='movie_title', how='inner').sort_values(by='inflation_adjusted_gross', ascending=False)
combined_df

Unnamed: 0,movie_title,director,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,David Hand,"Dec 21, 1937",Musical,G,"$184,925,485",5228953000.0
1,Pinocchio,Ben Sharpsteen,"Feb 9, 1940",Adventure,G,"$84,300,000",2188229000.0
2,Fantasia,full credits,"Nov 13, 1940",Musical,G,"$83,320,000",2187091000.0
8,101 Dalmatians,Wolfgang Reitherman,"Jan 25, 1961",Comedy,G,"$153,000,000",1362871000.0
6,Lady and the Tramp,Hamilton Luske,"Jun 22, 1955",Drama,G,"$93,600,000",1236036000.0
3,Cinderella,Wilfred Jackson,"Feb 15, 1950",Drama,G,"$85,000,000",920608700.0
11,The Jungle Book,Wolfgang Reitherman,"Oct 18, 1967",Musical,Not Rated,"$141,843,000",789612300.0
24,The Lion King,Roger Allers,"Jun 15, 1994",Adventure,G,"$422,780,140",761640900.0
23,Aladdin,Ron Clements,"Nov 11, 1992",Comedy,G,"$217,350,219",441969200.0
44,Frozen,Chris Buck,"Nov 22, 2013",Adventure,PG,"$400,738,009",414997200.0


There doesn't seem to be an obvious pattern with a specific director leading to the most profit. In fact, we only see one director repeated in the top 10 movies (Wolfgang Reitherman). 

Let's use a function previously created to look into this further. The function will take in a dataframe, a column to group by, and a column to sum. After running it, it will return a dataframe with the grouped by column and the summed column. In this case, I want to use it to group by the 'director', and to sum the 'inflation_adjusted_gross'. This will help us see which directors contribute most to inflation-adjusted gross.

In [12]:
# Let's start by importing the script

from script1 import group_and_sum

# Let's run black on our script to make sure it is up to standards 

!black script1.py

[1mAll done! ‚ú® üç∞ ‚ú®[0m
1 file left unchanged.[0m


#### Table 7. Data Grouped by Director with Inflation-Adjusted Gross Summed

In [13]:
# Now let's use the script and ask it to group by the 'director' column and sum the 'inflation_adjusted_gross' column

director_total_gross = group_and_sum (combined_df, 'director', 'inflation_adjusted_gross')
director_total_gross

Unnamed: 0_level_0,inflation_adjusted_gross
director,Unnamed: 1_level_1
David Hand,5228953000.0
Wolfgang Reitherman,3432920000.0
Ben Sharpsteen,2188229000.0
full credits,2187091000.0
Ron Clements,1318950000.0
Hamilton Luske,1236036000.0
Wilfred Jackson,1121760000.0
Roger Allers,761640900.0
Chris Buck,698897400.0
Gary Trousdale,679194600.0


We see that although David Hand only showed up once on Table 6, he was still the director with the highest inflation-adjusted gross due to his direction of the Snow White movie. 

In [14]:
# Now let's quickly test script2 to make sure it doesn't have any failures

!pytest script2.py

platform linux -- Python 3.8.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/prog-python-ds-students/release/final_project
plugins: anyio-3.2.1, dash-1.20.0
collected 2 items                                                              [0m[1m

script2.py [32m.[0m[32m.[0m[32m                                                            [100%][0m



What if the timeline under which the movie is released is related to its success? Maybe over the years, Disney movies have gained or lost popularity. Let's test this out and try to answer our last question:

4) Is the timeline under which a movie is released related to its inflation-adjusted gross?

In [15]:
# First, let's separate the 'release_date' column so we can have the year be its own column

dates = (combined_df['release_date'].str.split(',', expand=True)
                           .rename(columns = {0:'Date',
                                              1:'Year'}))

# Now let's merge the new dataframe with the separated year column to our 'combined_df' previously created, sorting it by year

dates_concatenated = pd.concat([combined_df.drop('release_date', axis=1), dates], axis=1).sort_values(by='Year')
dates_concatenated

# Let's create an Altair scatterplot to see what this tells us

scatter_plot_years = alt.Chart(dates_concatenated).mark_point().encode(
    x=alt.X(
        'Year:O', 
        title='Movie Release Year', 
        sort='ascending'
    ), 
    y=alt.Y(
        'inflation_adjusted_gross:Q', 
        title='Inflation-Adjusted Gross in $', 
        sort='-y'
    )
).properties(
    width=500, 
    height=500, 
    title='Movie Inflation-Adjusted Gross and Release Year'
)

scatter_plot_years

#### Figure 3. Movie Inflation-Adjusted Gross and Release Year

Figure 3 shows us that there is a somewhat general trend over time, except for a few outliars. The outliar that shows the most difference from the general trend is Snow White and the Seven Dwarves, released in 1937 and with an inflation-adjusted gross of 5.228953e+09. The second outliar is Pinochio, released in 1940 and with an inflation-adjusted gross of 2.188229e+09. Once again, the data shows that Snow White and the Seven Dwarves has significantly different inflation-adjusted gross values than the rest of the movies. The overall trend shows these outliars decreasing and from 1995 onwards, all of the movies fall under the first line on the plot (between 0 and 500,000,000).

## Discussion

Let's tackle each of the questions previously mentioned:

1) On average, what genre of movies tends to have the highest inflation-adjusted gross? 

My findings indicate that on average, musicals tend to make the most money. The analysis shows that the Snow White and the Seven Dwarfs musical is the movie with the highest inflation-adjusted gross of all time. It is important to mention that this movie is likely a significant reason as to why musicals are the most profitable genre on average. Snow White and the Seven Dwarfs is seen as a movie that is highly differentiated in terms of inflation-adjusted gross when compared to others (see Figures 2 and 3). The inflation-adjusted gross is 5.228953e+09 which is more than half than the second most-profitable movie (Pinocchio, with an inflation-adjusted gross of 2.188229e+09).


2) Do the movies with the highest inflation-adjusted gross all come from the same genre? 

The analysis showed that the movies with the highest inflation-adjusted gross are not all musicals. In fact, only three out of the top ten movies with the highest inflation-adjusted gross are musicals (Table 4).


3) Do the movies with the highest inflation-adjusted gross all come from the same director who has an in-depth understanding of the features that make for our most beloved characters?

The analysis implied that this is not the case. It was observed that the same director did not appear numerous times throughout the movies with highest inflation-adjusted gross (Table 6). However, Table 7 showed that even though David Hand only appeared once on the dataset (directing only one movie), he was still the director with the highest inflation-adjusted gross due to his direction of the Snow White movie. 


4) Is the timeline under which a movie is released related to its inflation-adjusted gross?

It appears that there are outliar movies that have high inflation-adjusted gross in comparison to other movies. These took place before 1995 (Figure 3). 


#### What I expected

I expected musicals to be the most popular as this category tends to be the most associated with successful Disney films (such as Frozen or The Lion King). Alongside this, I expected the highest inflation-adjusted gross movies to all be musicals. I expected to see the same director come up under the list of successful films (such as Tim Burton, who has directed many incredibly successful movies). I found it interesting that there are numerous directors and no specific pattern. I expected that the year a movie was released would not have any relation to the inflation-adjusted gross, however, Figure 3 shows inflation-adjusted gross decrease over time, with some outliars displaying very high inflation-adjusted gross in the past. 


#### What my findings mean

My findings show that besides musicals being most popular on average, it is not the only successful genre. This implies other genres possess other factors (perhaps the story line or different animation styles) that are also very appealing to the public. The variety in directors implies Disney does not tend to stick to one director for numerous movies, and that there is diversity in leadership at Disney. The timeline's relationship with inflation-adjusted gross could be caused by the fact that classic movies (such as Snow White and the Seven Dwarfs) remained popular over time, adding to the overall cumulative inflation-adjusted gross. Moreover, other platforms growing over time (i.e. Netflix) could explain decreased interest in Disney with time, as other platforms increase in popularity.


#### What I would like to learn about

Having performed this analysis, I would like to learn more about Disney's marketing strategies and how these may influence how successful movie releases are. I would also like to better understand how measuring inflation-adjusted gross might have changed over time, as this may account for the changes seen in Figure 3. 

## References

[Kaggle Walt Disney Dataset](https://www.kaggle.com/datasets/maricinnamon/walt-disney-character-dataset?select=disney-characters.csv%29) - last updated on Kaggle three years ago.