# The most popular genre of Disney movies (1937-2016)

# Introduction

## Question(s) of interests
In this analysis, I will be investigating a question associated with the collection of Desney datasets.
I am interested in finding out which genre has the most total gross with it. Each movie has its own genre. It would be interesting to see which genre is the most popular. I would expect the **Adventure** genre to have the most popularity.

## Dataset description 


Walt Disney Studios is the foundation on which The Walt Disney Company was built. The Studios has produced more than 600 films since its debut film, Snow White and the Seven Dwarfs in 1937. While many of its films were big hits, some of them were not. `disney_movies_total_gross.csv` dataset (using in this project) contains all the movies from 1937 to 2016 that were released by Disney. The data contains 579 Disney movies with six following attributes:

* *movie_title*
* *release_data*
* *genre*
* *mpaa_rating*
* *total_gross*
* *inflation_adjusted gross*


Disney Movies 1937-2016 Gross Income# Methods and Results

Since I am interested in finding the most popular genre, I will need to use `disney_movies_total_gross.csv` file that contain information on movies and gross.

However, before moving further, let us import all the packeages I need in the project.

In [68]:
import pandas as pd
import numpy as np
import altair as alt

In [69]:
movies_gross = pd.read_csv('data/disney_movies_total_gross.csv')
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,"Feb 9, 1940",Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,"Nov 13, 1940",Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,"Nov 12, 1946",Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,"Feb 15, 1950",Drama,G,"$85,000,000","$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,"Sep 2, 2016",Drama,PG-13,"$12,545,979","$12,545,979"
575,Queen of Katwe,"Sep 23, 2016",Drama,PG,"$8,874,389","$8,874,389"
576,Doctor Strange,"Nov 4, 2016",Adventure,PG-13,"$232,532,923","$232,532,923"
577,Moana,"Nov 23, 2016",Adventure,PG,"$246,082,029","$246,082,029"


Lets get the information of the dataframe.

In [70]:
movies_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   movie_title               579 non-null    object
 1   release_date              579 non-null    object
 2   genre                     562 non-null    object
 3   MPAA_rating               523 non-null    object
 4   total_gross               579 non-null    object
 5   inflation_adjusted_gross  579 non-null    object
dtypes: object(6)
memory usage: 27.3+ KB


Every **movie_title** has **total_gross** amount. Some movives have null value for their genre. So, I'm gonna put NaN for their genre.

In [71]:
movies_gross = movies_gross.fillna("NaN")
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,"Feb 9, 1940",Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,"Nov 13, 1940",Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,"Nov 12, 1946",Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,"Feb 15, 1950",Drama,G,"$85,000,000","$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,"Sep 2, 2016",Drama,PG-13,"$12,545,979","$12,545,979"
575,Queen of Katwe,"Sep 23, 2016",Drama,PG,"$8,874,389","$8,874,389"
576,Doctor Strange,"Nov 4, 2016",Adventure,PG-13,"$232,532,923","$232,532,923"
577,Moana,"Nov 23, 2016",Adventure,PG,"$246,082,029","$246,082,029"


I know from the info that the values in every column are object. For example, to check if **total_gross** are numbers (int or float), we can see:

In [72]:
print(movies_gross['total_gross'].dtype)

object


It seems that because of the **$** sign and **,** between characters, the values recognized as an object. So, we need to remove it.

In [73]:
movies_gross['total_gross'] = movies_gross['total_gross'].str.strip('$').str.replace(',', '')
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,184925485,"$5,228,953,251"
1,Pinocchio,"Feb 9, 1940",Adventure,G,84300000,"$2,188,229,052"
2,Fantasia,"Nov 13, 1940",Musical,G,83320000,"$2,187,090,808"
3,Song of the South,"Nov 12, 1946",Adventure,G,65000000,"$1,078,510,579"
4,Cinderella,"Feb 15, 1950",Drama,G,85000000,"$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,"Sep 2, 2016",Drama,PG-13,12545979,"$12,545,979"
575,Queen of Katwe,"Sep 23, 2016",Drama,PG,8874389,"$8,874,389"
576,Doctor Strange,"Nov 4, 2016",Adventure,PG-13,232532923,"$232,532,923"
577,Moana,"Nov 23, 2016",Adventure,PG,246082029,"$246,082,029"


Lets check it again!

In [74]:
print(movies_gross['total_gross'].dtype)

object


We need one more step to change the type to the float.

In [75]:
movies_gross['total_gross'] = movies_gross['total_gross'].astype('float')
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,"Dec 21, 1937",Musical,G,184925485.0,"$5,228,953,251"
1,Pinocchio,"Feb 9, 1940",Adventure,G,84300000.0,"$2,188,229,052"
2,Fantasia,"Nov 13, 1940",Musical,G,83320000.0,"$2,187,090,808"
3,Song of the South,"Nov 12, 1946",Adventure,G,65000000.0,"$1,078,510,579"
4,Cinderella,"Feb 15, 1950",Drama,G,85000000.0,"$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,"Sep 2, 2016",Drama,PG-13,12545979.0,"$12,545,979"
575,Queen of Katwe,"Sep 23, 2016",Drama,PG,8874389.0,"$8,874,389"
576,Doctor Strange,"Nov 4, 2016",Adventure,PG-13,232532923.0,"$232,532,923"
577,Moana,"Nov 23, 2016",Adventure,PG,246082029.0,"$246,082,029"


In [76]:
print(movies_gross['total_gross'].dtype)

float64


Also, I am gonna change the type of **release_date** column to datetime.

In [77]:
movies_gross['release_date'] = movies_gross['release_date'].str.replace(',', '')
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,Dec 21 1937,Musical,G,184925485.0,"$5,228,953,251"
1,Pinocchio,Feb 9 1940,Adventure,G,84300000.0,"$2,188,229,052"
2,Fantasia,Nov 13 1940,Musical,G,83320000.0,"$2,187,090,808"
3,Song of the South,Nov 12 1946,Adventure,G,65000000.0,"$1,078,510,579"
4,Cinderella,Feb 15 1950,Drama,G,85000000.0,"$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,Sep 2 2016,Drama,PG-13,12545979.0,"$12,545,979"
575,Queen of Katwe,Sep 23 2016,Drama,PG,8874389.0,"$8,874,389"
576,Doctor Strange,Nov 4 2016,Adventure,PG-13,232532923.0,"$232,532,923"
577,Moana,Nov 23 2016,Adventure,PG,246082029.0,"$246,082,029"


In [78]:
movies_gross['release_date'] = movies_gross['release_date'].astype('datetime64')
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485.0,"$5,228,953,251"
1,Pinocchio,1940-02-09,Adventure,G,84300000.0,"$2,188,229,052"
2,Fantasia,1940-11-13,Musical,G,83320000.0,"$2,187,090,808"
3,Song of the South,1946-11-12,Adventure,G,65000000.0,"$1,078,510,579"
4,Cinderella,1950-02-15,Drama,G,85000000.0,"$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,PG-13,12545979.0,"$12,545,979"
575,Queen of Katwe,2016-09-23,Drama,PG,8874389.0,"$8,874,389"
576,Doctor Strange,2016-11-04,Adventure,PG-13,232532923.0,"$232,532,923"
577,Moana,2016-11-23,Adventure,PG,246082029.0,"$246,082,029"


Lets check it togetehr:

In [79]:
print(movies_gross['release_date'].dtype)

datetime64[ns]


First, I am gonna put each genre in a group.

In [80]:
movies_gross_grouped = movies_gross.groupby(['genre']).max().reset_index()

Then, lets check the info to see if we still have the numbers or they are changed!

In [81]:
movies_gross_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   genre                     13 non-null     object        
 1   movie_title               13 non-null     object        
 2   release_date              13 non-null     datetime64[ns]
 3   MPAA_rating               13 non-null     object        
 4   total_gross               13 non-null     float64       
 5   inflation_adjusted_gross  13 non-null     object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 752.0+ bytes


In [82]:
movies_gross_grouped

Unnamed: 0,genre,movie_title,release_date,MPAA_rating,total_gross,inflation_adjusted_gross
0,Action,Tron,2016-05-06,R,623279547.0,"$96,971,361"
1,Adventure,Zootopia,2016-12-16,R,936662225.0,"$95,208,344"
2,Black Comedy,The Royal Tenenbaums,2001-12-14,R,52353636.0,"$76,758,193"
3,Comedy,You Again,2014-10-10,R,244082982.0,"$98,067,733"
4,Concert/Performance,Jonas Brothers: The 3D Concert Experi…,2009-02-27,G,65281781.0,"$76,646,993"
5,Documentary,X Games 3D: The Movie,2016-04-29,PG,32011576.0,"$86,264"
6,Drama,crazy/beautiful,2016-09-23,R,201151353.0,"$97,356,578"
7,Horror,The Puppet Masters,2011-08-19,R,26570463.0,"$9,907,922"
8,Musical,Tim Burton's The Nightmare Before Chr…,2014-12-25,PG-13,218951625.0,"$94,852,354"
9,,The War at Home,2002-01-01,R,35841901.0,"$9,156,084"


Now it's the time of making graphs to visualize our data. To see which genre has the max total gross, we can run the code below:

In [83]:
max_gross = alt.Chart(movies_gross_grouped, width = 500, height = 300).mark_bar(color = 'purple').encode(
    x = alt.X('total_gross', title = 'Totla gross'),
    y = alt.Y('genre', sort = '-x', title = 'Genre')
).properties(title = 'Which genre has the max gross?')

max_gross


As expected, the `Adventure` has the most gross compare with other genres.

Now, I am interested in finding out which genre grossed more over the years?
To visualize it, we need to make variable decade which shows how trends of Disney movies changed over decades. To this aim, I wrote a loop with conditions:

In [92]:
lis=[]
for i in range(579):
    if movies_gross['release_date'][i].year > 2010:
        lis.append('2010-2020')
    elif movies_gross['release_date'][i].year <= 2010 and movies_gross['release_date'][i].year > 2000:
        lis.append('2000-2010')
    elif movies_gross['release_date'][i].year <= 2000 and movies_gross['release_date'][i].year > 1990:
        lis.append('1990-2000')
    elif movies_gross['release_date'][i].year <= 1990 and movies_gross['release_date'][i].year > 1980:
        lis.append('1980-1990')
    elif movies_gross['release_date'][i].year <= 1980 and movies_gross['release_date'][i].year > 1970:
        lis.append('1970-1980')
    elif movies_gross['release_date'][i].year <= 1970 and movies_gross['release_date'][i].year > 1960:
        lis.append('1960-1970')
    elif movies_gross['release_date'][i].year <= 1960 and movies_gross['release_date'][i].year > 1950:
        lis.append('1950-1960')
    else:
        lis.append('<1950')
movies_gross['decade'] = lis
movies_gross

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross,decade
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485.0,"$5,228,953,251",<1950
1,Pinocchio,1940-02-09,Adventure,G,84300000.0,"$2,188,229,052",<1950
2,Fantasia,1940-11-13,Musical,G,83320000.0,"$2,187,090,808",<1950
3,Song of the South,1946-11-12,Adventure,G,65000000.0,"$1,078,510,579",<1950
4,Cinderella,1950-02-15,Drama,G,85000000.0,"$920,608,730",<1950
...,...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,PG-13,12545979.0,"$12,545,979",2010-2020
575,Queen of Katwe,2016-09-23,Drama,PG,8874389.0,"$8,874,389",2010-2020
576,Doctor Strange,2016-11-04,Adventure,PG-13,232532923.0,"$232,532,923",2010-2020
577,Moana,2016-11-23,Adventure,PG,246082029.0,"$246,082,029",2010-2020


In [94]:
decade_gross = alt.Chart(movies_gross).mark_circle().encode(
    x = alt.X('release_date', title = 'Releas date'),
    y = alt.Y('total_gross', title = 'Totla gross'),
    color = alt.Color('genre')
).properties(title = 'Which genre grossed more over the years?')

decade_gross



# Discussions

In this work, I analyzed the `disney_movies_total_gross.csv` dataset and tried to compute which genre had the most gross. Before answering this question, I did some exploratory data analysis to see if the type of data meet the requirements of analysis. Then, I groeped the genre to see what is the maximum total gross in each group (genre). As expected, **Adventure** had the most gross. This is agreed with the graph that I plotted. The maximum total gross of the Adventure genre is $936662225.0.

Next question that I tried to answer was *Which genre grossed more over the years?* To find it out, I first define a function to sort release dates in separated decades. Then, I plotted total gross of the different genres in each decade. The first movie produced by Disney was a musical which was classic 'Snow White and Seven dwarfs'.
we can see the frequency of movies is increasing with year.
Total Gross of movies is also increasing with year.
Another interesting insight is the popularity of genre changed over time so we can further use the decade variable for further analysis.

Another question that could be looked at given this dataset is the *Which MPAA-rated movie grossed more over the years?* as well as *Which MPAA-rated movie grossed more after inflation over the years?*
We can find if similar trends in MPAA rating early Disney movies were used to be as same as recent ones rate.
Also, the impact of inflation in the gross income of movie and its relation with rate would be interesting.