# Final Project: Disney Movies Data Analysis: Disney Movies and Box Office Success
## Author: Nuray CAN
## Submission Date: Nov 6, Monday, 2023

# Introduction

## Question(s) of interests
I will be conducting an analysis on a query related to the Disney datasets. Specifically, I will investigate the question of **"What is the most preferred movie genre among audiences?"**. Our aim with this analysis is to explore the success of movies in different genres by examining the average inflation adjusted gross for each genre in a given dataset. By focusing on financial performance as a measure of success, we hope to identify patterns and variations in the movie industry. 

## Dataset description 

In this notebook, we will analyze a dataset of Disney movies to explore what makes them successful.This notebook will be showing some exploratory data analysis for the `Disney` dataset located [here](https://data.world/kgarrett/disney-character-success-00-16). 

The disney dataset includes the following tables, `disney-characters.csv`, `disney-director.csv`, `disney-voice-actors.csv`, `idisney_revenue_1991-2016.csv` and `disney_movies_total_gross.csv` . Each table is stored in a `.csv` file and contains different information about disney movies including genres, revenue, directors, characters, and financial information. I will be using the `disney_movies_total_gross` table formally described below:

* **disney_movies_total_gross.csv**
    * This file contains information about Disney movie box office gross and inflation adjustments. Furthermore, it includes movie title,release date and Motion Picture Association film ratings(G: General audiences – All ages admitted, PG: Parental guidance suggested,PG-13: Parents strongly cautioned,R: Restricted – Under 17 requires accompanying parent or adult guardian)

## 1. Import Required Libraries

In [26]:
# Lets import all the required libraries needed for this analysis
import pandas as pd # data processing
import numpy as np # linear algebra
import altair as alt # data visualization

## 2. Load Related Dataset

In [27]:
# Load the Disney movie total gross dataset
disney_movie_gross = pd.read_csv('data/disney_movies_total_gross.csv',parse_dates=['release_date'])# convert release_date to datetime
disney_movie_gross.head() # Shows first 5 rows

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,1940-02-09,Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,1940-11-13,Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,1946-11-12,Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,1950-02-15,Drama,G,"$85,000,000","$920,608,730"


## 3. Get more info about table

In [28]:
#Let's get more info about table
disney_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               579 non-null    object        
 1   release_date              579 non-null    datetime64[ns]
 2   genre                     562 non-null    object        
 3   MPAA_rating               523 non-null    object        
 4   total_gross               579 non-null    object        
 5   inflation_adjusted_gross  579 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 27.3+ KB


In [29]:
#This code retrieves the dimensions (number of rows and columns) 
disney_movie_gross.shape

(579, 6)

## 4. Convert datatypes to proper datatypes 

In [30]:
# Convert the 'total_gross' column from string to numeric
disney_movie_gross['total_gross'] = disney_movie_gross['total_gross'].str.replace('[\$,]', '', regex=True).astype(float)
# Now, the 'total_gross' column is numeric

In [31]:
type_total_gross=disney_movie_gross['total_gross'].dtype
print(type_total_gross)

float64


In [32]:
# Convert the 'inflation_adjusted_gross' column from string to numeric
disney_movie_gross['inflation_adjusted_gross'] = disney_movie_gross['inflation_adjusted_gross'].str.replace('[\$,]', '', regex=True).astype(float)
# Now, the 'inflation_adjusted_gross' column is numeric

In [33]:
type_inf_adjusted_gross=disney_movie_gross['inflation_adjusted_gross'].dtype
print(type_inf_adjusted_gross)

float64


In [34]:
#print first 5 row to see what the tables look like.
disney_movie_gross.head()

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485.0,5228953000.0
1,Pinocchio,1940-02-09,Adventure,G,84300000.0,2188229000.0
2,Fantasia,1940-11-13,Musical,G,83320000.0,2187091000.0
3,Song of the South,1946-11-12,Adventure,G,65000000.0,1078511000.0
4,Cinderella,1950-02-15,Drama,G,85000000.0,920608700.0


In [35]:
# check again the data types
disney_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               579 non-null    object        
 1   release_date              579 non-null    datetime64[ns]
 2   genre                     562 non-null    object        
 3   MPAA_rating               523 non-null    object        
 4   total_gross               579 non-null    float64       
 5   inflation_adjusted_gross  579 non-null    float64       
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 27.3+ KB


The **disney_movie_gross** table has $579$ rows with $6$ columns. Every movie has a **title**, a **release date**, a **genre**, a **Mpaa rating**, a **total gross**, an **inflation adjusted gross** and a **release year**.

## 4. Assess Data Cleanliness: Determine its Tidiness

In [36]:
#This code counts and returns the number of missing values (NaN) in each column
disney_movie_gross.isnull().sum()

movie_title                  0
release_date                 0
genre                       17
MPAA_rating                 56
total_gross                  0
inflation_adjusted_gross     0
dtype: int64

In [37]:
#Let's drop missing values in the 'genre' columns
disney_movie_gross = disney_movie_gross.dropna(subset=['genre'])
#Let's fill missing values in the 'MPAA_rating' columns with 'Not Rated'. 
disney_movie_gross['MPAA_rating'] = disney_movie_gross['MPAA_rating'].fillna('Not Rated')

In [38]:
#Let's check again the number of missing values (NaN) in each column
disney_movie_gross.isnull().sum()

movie_title                 0
release_date                0
genre                       0
MPAA_rating                 0
total_gross                 0
inflation_adjusted_gross    0
dtype: int64

In [39]:
#summarize the statistical values about our dataset to get more info about dataset
disney_movie_gross.describe()

Unnamed: 0,total_gross,inflation_adjusted_gross
count,562.0,562.0
mean,66448910.0,121700900.0
std,93842270.0,289813100.0
min,0.0,0.0
25%,14176480.0,24479280.0
50%,32014740.0,56278630.0
75%,77927660.0,122410800.0
max,936662200.0,5228953000.0


## 5. Find Top 15 movies
I will sort Disney movies by their inflation-adjusted gross to identify the top 15 highest earners at the box office.

In [40]:
disney_movie_gross = disney_movie_gross.sort_values('inflation_adjusted_gross', ascending=False)

# Add a 'release_year' column in case it becomes necessary for future use.
disney_movie_gross['release_year'] = disney_movie_gross['release_date'].dt.year
#disney_movie_gross=pd.DataFrame(disney_movie_gross)

# Display the top 15 movies 
top_15_movies= disney_movie_gross[:15]
top_15_movies

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross,release_year
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485.0,5228953000.0,1937
1,Pinocchio,1940-02-09,Adventure,G,84300000.0,2188229000.0,1940
2,Fantasia,1940-11-13,Musical,G,83320000.0,2187091000.0,1940
8,101 Dalmatians,1961-01-25,Comedy,G,153000000.0,1362871000.0,1961
6,Lady and the Tramp,1955-06-22,Drama,G,93600000.0,1236036000.0,1955
3,Song of the South,1946-11-12,Adventure,G,65000000.0,1078511000.0,1946
564,Star Wars Ep. VII: The Force Awakens,2015-12-18,Adventure,PG-13,936662225.0,936662200.0,2015
4,Cinderella,1950-02-15,Drama,G,85000000.0,920608700.0,1950
13,The Jungle Book,1967-10-18,Musical,Not Rated,141843000.0,789612300.0,1967
179,The Lion King,1994-06-15,Adventure,G,422780140.0,761640900.0,1994


In [41]:
# now plot it using altair
top_15_movies_plot = (
    alt.Chart(top_15_movies, width=500, height=300)
    .mark_circle()
    .encode(
        x=alt.X("movie_title:O", sort="-y", title="Movie name"),
        y=alt.Y("inflation_adjusted_gross:Q", title="Inflation Adjusted Gross Box Office Earnings"),
    )
    .properties(title="Figure1: Top 15 movies")
)
top_15_movies_plot

## 6. Analyze the movie success regarding movie genres
According to Figure 1, it appears that certain movie genres are more popular than others in the top 15 list. Therefore, we will analyze which genres are experiencing an increase in popularity.

### 6a. Import my function and format it 

In [42]:
# import the custom script to calculate average inflation adjusted gross for each genre
from myfunction import calculate_average_inflation_adjusted_gross

!flake8 myfunction.py

myfunction.py:2:24: W291 trailing whitespace
myfunction.py:12:80: E501 line too long (102 > 79 characters)
myfunction.py:17:80: E501 line too long (100 > 79 characters)
myfunction.py:27:80: E501 line too long (85 > 79 characters)
myfunction.py:28:80: E501 line too long (81 > 79 characters)
myfunction.py:51:80: E501 line too long (85 > 79 characters)
myfunction.py:87:80: E501 line too long (106 > 79 characters)
myfunction.py:91:80: E501 line too long (109 > 79 characters)
myfunction.py:101:80: E501 line too long (82 > 79 characters)
myfunction.py:108:80: E501 line too long (81 > 79 characters)


### 6b. Import my test function and format it 

In [43]:
!black myfunction.py #format myfunction

[1mAll done! ✨ 🍰 ✨[0m
1 file left unchanged.[0m


In [44]:
!flake8 test_myfunction.py

test_myfunction.py:2:24: W291 trailing whitespace
test_myfunction.py:6:74: W291 trailing whitespace
test_myfunction.py:60:80: E501 line too long (85 > 79 characters)
test_myfunction.py:62:80: E501 line too long (81 > 79 characters)
test_myfunction.py:68:80: E501 line too long (96 > 79 characters)
test_myfunction.py:72:1: E402 module level import not at top of file


In [45]:
!black test_myfunction.py

[1mAll done! ✨ 🍰 ✨[0m
1 file left unchanged.[0m


### 6c. Test my function  

In [46]:
#Use this code chunk to check your tests on the file test_sampling.py using pytest
!pytest test_myfunction.py

platform linux -- Python 3.8.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/prog-python-ds-students/release/final_project
plugins: anyio-3.2.1, dash-1.20.0
collected 2 items                                                              [0m[1m

test_myfunction.py [32m.[0m[32m.[0m[32m                                                    [100%][0m



The test session results indicate that the test for the myfunction function, as defined in the test_myfunction.py file, has passed successfully.Therfore, this positive outcome assures that the myfunction function is performing as intended, meeting the defined criteria, and is ready for use in the final project or application.

### 6d. Calculate the mean inflation adjusted gross for each genre

In [47]:
# Group movies by genre and calculate the mean inflation adjusted gross for each genre
genre_success=calculate_average_inflation_adjusted_gross(disney_movie_gross,'genre','inflation_adjusted_gross')
genre_success

Unnamed: 0,genre,inflation_adjusted_gross
0,Action,137473400.0
1,Adventure,190397400.0
2,Black Comedy,52243490.0
3,Comedy,84667730.0
4,Concert/Performance,57410840.0
5,Documentary,12718030.0
6,Drama,71893020.0
7,Horror,23413850.0
8,Musical,603597900.0
9,Romantic Comedy,77777080.0


### 6e. Visualize the movie success regarding movie genres

In [48]:
# Visualize the movie success regarding movie types and their infilation adjusted gross with the most sets using a bar plot.
chart_genre_success = (
    alt.Chart(genre_success, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X('genre:N', title='Movie Genre', sort='y'),
        y=alt.Y('inflation_adjusted_gross:Q', title='Average Inflation Adjusted Gross'),
    )
    .properties(title='Figure 2: Exploring Movie Success Across Genres: Inflation Adjusted Gross Analysis')
)
chart_genre_success

### 6f. Analyzing the Factors Behind the Elevated Success of Musicals:
  
  The Figure 2 provides information about Disney movies categorized by genre and their respective inflation-adjusted gross earnings. It's evident that Disney's musicals have been exceptionally successful, generating the highest earnings, followed closely by adventure and action genres. Meanwhile, genres such as horror and documentary appear to have lower earnings in comparison. These results offer valuable insights into the financial performance of Disney movies across different genres.
 
  The exceptional success of Disney's musical genre, as seen in the significantly higher inflation-adjusted gross earnings, can be attributed to several factors. Musicals often appeal to a broad audience, transcending age and cultural boundaries. They typically feature catchy songs, memorable characters, and engaging storylines that resonate with viewers, making them more likely to be rewatched and purchased. Disney has a rich history of creating iconic musical animations like **"The Lion King"** and **"Frozen,"** which have left a lasting impact on popular culture. This consistent track record of delivering musical excellence contributes to the genre's continued financial success. Moreover, musicals tend to generate additional revenue through the sale of soundtracks and merchandise, further boosting their profitability.

## 7. Check the distribution of movie genres

The distribution of movie genres is a valuable complement to the financial data provided in the second figure because it helps us understand why certain genres may have performed better financially. By examining the prevalence of different genres in the overall movie dataset, we can discern whether Disney's success in certain genres, such as musicals, is due to their unique appeal and popularity.

For instance, if we see a relatively small number of musicals in the distribution compared to other genres but observe that they have high earnings in the financial data, it suggests that musicals, while not produced as frequently, have a strong financial impact when they are released. This is a valuable insight that connects the distribution of genres to the financial performance of Disney movies, helping us understand why certain genres outperform others in terms of earnings.

In [49]:
## import the custom script to calculate genre numbers
from myfunction import calculate_genre_counts

genre_count = calculate_genre_counts(disney_movie_gross,'genre')
print(genre_count)

                  genre  counts
0                Action      40
1             Adventure     129
2          Black Comedy       3
3                Comedy     182
4   Concert/Performance       2
5           Documentary      16
6                 Drama     114
7                Horror       6
8               Musical      16
9       Romantic Comedy      23
10    Thriller/Suspense      24
11              Western       7


In [50]:
# Visualize the counts of movie genres 
chart_genre_count = (
    alt.Chart(genre_count, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X('genre:N', title='Movie Genre', sort='y'),
        y=alt.Y('counts:Q', title='The count of Genres'),
    )
    .properties(title='Figure 3: The counts of Genres in Disney Movie datasets')
)
chart_genre_count

Based on Figure 2 and Figure 3, while the number of comedy movies is higher than musical movies, the musical movies generate a higher average box office gross which makes it more preferable. This preference for certain movie genres over others can be influenced by various factors, including audience demographics, cultural trends, and marketing strategies. Therefore, it's crucial to consider these factors when analyzing why adventure movies might be more preferable than comedy movies, even if there are more comedy movies available.

Comparing this figure to the previous one, we can see that the number of movies in each genre may not always correlate with their financial success. For instance, while musicals were among the highest earners in the first table, they have a relatively modest count of 16 movies in this table. This suggests that musicals, though financially lucrative, are not as prolific in terms of movie production as some other genres, which may indicate a different approach to market dynamics.

# Discussions

In this analysis of Disney movies data, we explored audience preferences across various genres by using box office gross earnings.The analysis showed that Disney's musical genre is a financial powerhouse, generating the highest gross earnings when adjusted for inflation.This success can be attributed to its universal appeal, characterized by catchy songs, memorable characters, and compelling storylines that resonate with a diverse audience.In addition, the profitability of the genre can be boosted by the sale of soundtracks and merchandise.However, the number of movies in each genre does not necessarily dictate their financial success, as revealed by this analysis.For example, despite being high earners, musicals have fewer movies, revealing a unique market dynamic.

Our findings align with the expectation that musicals, given their broad appeal and cultural significance, would be successful. However, it's interesting to note that success isn't solely determined by the number of movies produced in a genre.While this analysis has shed light on the financial success of Disney movie genres, several other questions remain unanswered. For instance, it would be interesting to explore the influence of specific movie directors, release dates, and MPAA ratings on earnings. 

In order to ensure the success of specific movie genres, it is important to understand audience demographics, cultural trends, and marketing strategies with more data analysis. This analysis can provide valuable insights for Disney and the entertainment industry, offering a better understanding of why certain genres outperform others in terms of earnings and popularity. It can inform future decision-making processes and content creation strategies for Disney and other film studios.

# References

## Resources used
* [Data Source](https://data.world/kgarrett/disney-character-success-00-16)
    * The data were obtain from data.world Links to an external site.which follows a Creative Common Attribution 4.0 International License
    * Disney Character Success dataset is from [Kelly Garrett](https://data.world/kgarrett)

* [Understanding inflation adjustment](https://help.imdb.com/article/imdbpro/industry-research/box-office-mojo-by-imdbpro-faq/GCWTV4MQKGWRAUAP?ref_=mojo_cso_md#inflation)
    * How are grosses adjusted for ticket price inflation?
    
* [Understanding the history of Disney](https://d23.com/disney-history/)
    * Disney History

