<a id="intro"></a>

# 1. Introduction

![DISNEY](disney.jfif)

Walt Disney and Roy O. Disney created Disney in 1923, and it is now one of the world's leading entertainment enterprises. Disney is well known for its legendary animated films, such as Snow White and the Seven Dwarfs (1937), and has since expanded into live-action films, television, theme parks, and media networks. Over time, Disney bought other big studios such as Pixar, Marvel, Lucasfilm, and 20th Century Fox, considerably extending its influence. The Disney brand is known for its family-friendly content, imaginative storytelling, and worldwide cultural impact.

Analyzing the Disney movies dataset can uncover trends, patterns, and insights related to movie performance (US & Canada), genres, release dates, and ratings to better understand Disney's movie portfolio from 1937 to 2016.

1. [Introduction](#intro)
2. [Data Acquisition](#dataq)
3. [Data Cleaning](#datac) 
4. [Exploratory Data Analysis](#eda)

<a id="dataq"></a>

# 2. Data Acquisition
## Importing Libraries
Importing all libraries needed. More libraries will be imported if required.

In [125]:
# Import the necessary libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [126]:
# Read in the data
disney = pd.read_csv("disney_movies_total_gross.csv")

# Preview the data
disney.head()

Unnamed: 0,movie_title,release_date,genre,mpaa_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
4,Cinderella,1950-02-15,Drama,G,85000000,920608730


In [127]:
# Checking the number of rows and columns
disney.shape

(579, 6)

In [128]:
# Display the information of the dataset
disney.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   movie_title               579 non-null    object
 1   release_date              579 non-null    object
 2   genre                     562 non-null    object
 3   mpaa_rating               523 non-null    object
 4   total_gross               579 non-null    int64 
 5   inflation_adjusted_gross  579 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 27.3+ KB


In [145]:
# Display the statistics of the data
disney.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
release_date,579.0,1998-07-10 13:20:49.740932608,1937-12-21 00:00:00,1993-03-29 12:00:00,1998-09-04 00:00:00,2006-02-06 12:00:00,2016-12-16 00:00:00,
total_gross,579.0,64701788.519862,0.0,12788864.0,30702446.0,75709033.0,936662225.0,93013006.117427
inflation_adjusted_gross,579.0,118762523.310881,0.0,22741232.0,55159783.0,119202000.0,5228953251.0,286085280.041417
year,579.0,1998.01209,1937.0,1993.0,1998.0,2006.0,2016.0,11.410924


<a id="datac"></a>

# 3. Data Cleaning
The missing values were handled by replacing them with another value.

In [130]:
# Check for null values
disney.isna().sum()

movie_title                  0
release_date                 0
genre                       17
mpaa_rating                 56
total_gross                  0
inflation_adjusted_gross     0
dtype: int64

In [131]:
# Fill null values with 'Not Rated' and 'NaN'
disney['mpaa_rating'] = disney['mpaa_rating'].fillna("Not Rated")
disney['genre'] = disney['genre'].fillna("NaN")

In [132]:
# Convert data type of release_date to datetime
disney['release_date'] = pd.to_datetime(disney['release_date'])

<a id="eda"></a>

# 4. Exploratory Data Analysis
Data visualization techniques are used to explore and understand various aspects of Disney's movie portfolio.

In [150]:
# Find total movies by genre
genre_disney = disney['genre'].value_counts(sort=True) 

fig = px.bar(genre_disney, title="Total Movies By Genre")

# Update axis titles
fig.update_layout(
    xaxis_title="Genre",
    yaxis_title="Number of Movies",
    showlegend=False
)

# Change the color of the first bar
colors = ['royalblue'] + ['lightblue'] * (len(genre_disney) - 1)  
fig.update_traces(marker_color=colors)

fig.show()

Comedy movies are the most prevalent, with the highest number of movies which is 182 movies, followed by adventure and drama movies.

In [134]:
# Find total movies by mpaa rating
rating_disney = disney['mpaa_rating'].value_counts(sort=True)

fig = px.bar(rating_disney, title="Total Movies By MPAA rating")

# Update axis titles
fig.update_layout(
    xaxis_title="MPAA Rating",
    yaxis_title="Number of Movies",
    showlegend=False
)

# Change the color of the first bar
colors = ['royalblue'] + ['lightblue'] * (len(genre_disney) - 1)  
fig.update_traces(marker_color=colors)

fig.show()

PG-rated Disney movies are the most common, followed by PG-13 and R-rated movies.

In [135]:
# Extract the year from the datetime column
disney['year'] = disney['release_date'].dt.year  

# Find the number of movies released per year
movies_per_year = disney['year'].value_counts().sort_index()  # Sort index to have years in ascending order

In [136]:
fig = px.line(movies_per_year, title='Total Movies Released Over the Years')

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Number of Movies',
    hovermode='x unified',  # This makes the hover tooltip appear for all points on the x-axis
    showlegend=False
)

fig.show()

1995 marked the peak year for Disney movie releases, with the highest number of movies launched that year which is 32 movies.

In [137]:
# Sort movies by total_gross in descending order
sorted_movies = disney.sort_values(by='total_gross', ascending=False)

top_10_movies = sorted_movies.head(10) # Subset only top 10 movies

fig = px.bar(top_10_movies, x='movie_title', y='total_gross', title="Top 10 Highest Grossing Movies")

# Update axis titles
fig.update_layout(
    xaxis_title="Movie",
    yaxis_title="Total Gross",
    showlegend=False
)

# Change the color of the first bar
colors = ['royalblue'] + ['lightblue'] * (len(genre_disney) - 1)  
fig.update_traces(marker_color=colors)

fig.show()

"Star Wars Episode VII: The Force Awakens" achieved the highest total gross among all Disney movies, with a remarkable $936.7 million in revenue.

In [138]:
bottom_10_movies = sorted_movies.tail(10) # Subset only bottom 10 movies

fig = px.bar(bottom_10_movies, x='movie_title', y='total_gross', title="Top 10 Lowest-Grossing Disney Movies")

# Update axis titles
fig.update_layout(
    xaxis_title="Movie",
    yaxis_title="Total Gross",
    showlegend=False
)

# Change the color of the first bar
colors = ['lightblue'] * (len(genre_disney) - 1) + ['royalblue']
fig.update_traces(marker_color=colors)

fig.show()

No information can be found on the total gross of the bottom 4 movies, likely because they did not perform notably at the box office, and their total gross are not widely documented.

In [139]:
fig = px.scatter(disney, x='release_date', y='total_gross', color='genre', title='Total Gross of Disney Movies by Genre Over the Years')

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Gross',
    legend_title_text='Movie Genre'
)

fig.show() 

The amount of movies produced by Disney as well as the total gross increases by year. Besides, it can also be seen that over the past two decades, the highest-grossing Disney movies have predominantly been of the adventure genre.

In [140]:
fig = px.scatter(disney, x='release_date', y='inflation_adjusted_gross', color='genre', title='Total Gross of Disney Movies After Inflation by Genre Over the Years')

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Gross (After Inflation)',
    legend_title_text='Movie Genre'
)

fig.show() 

After adjusting for inflation, the gross income of earlier Disney movies has increased, making "Snow White and the Seven Dwarfs" Disney's most valuable film.

In [141]:
fig = px.scatter(disney, x='release_date', y='total_gross', color='mpaa_rating', title='Total Gross of Disney Movies by MPAA Rating Over the Years')

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Gross',
    legend_title_text='MPAA Rating'
)

fig.show() 

The amount of movies produced by Disney as well as the total gross increases by year. Besides, it can also be seen that over the past two decades, the highest-grossing Disney movies have predominantly been rated PG or PG-13.

In [142]:
fig = px.scatter(disney, x='release_date', y='inflation_adjusted_gross', color='mpaa_rating', title='Total Gross of Disney Movies After Inflation by MPAA Rating Over the Years')

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Gross (After Inflation)',
    legend_title_text='MPAA Rating'
)

fig.show() 

After adjusting for inflation, the gross income of earlier Disney movies has increased, making the earlier movies which are rated G, the most valuable films.

In [143]:
# Read in the data
disney_imdb_rating = pd.read_csv("disney_last.csv")

# Select only the 'title' and 'imdb' columns from disney_imdb_rating
imdb_ratings = disney_imdb_rating[['title', 'imdb']]

# Merge both dataframes to get the IMDb ratings for the movies (only the available ones)
disney_rated = imdb_ratings.merge(disney, left_on='title', right_on='movie_title', how='inner')

# Preview the data
disney_rated.head()

Unnamed: 0,title,imdb,movie_title,release_date,genre,mpaa_rating,total_gross,inflation_adjusted_gross,year
0,Snow White and the Seven Dwarfs,7.6,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251,1937
1,Pinocchio,7.4,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052,1940
2,Fantasia,7.7,Fantasia,1940-11-13,Musical,G,83320000,2187090808,1940
3,Song of the South,7.1,Song of the South,1946-11-12,Adventure,G,65000000,1078510579,1946
4,Cinderella,6.9,Cinderella,1950-02-15,Drama,G,85000000,920608730,1950


In [144]:
fig = px.scatter(disney_rated, x='imdb', y='total_gross', trendline="ols", title='Total Gross of Disney Movies by IMDb Rating')

fig.update_layout(
    xaxis_title='IMDb Rating',
    yaxis_title='Total Gross',
)

fig.show() 

In [152]:
# Calculate correlation
correlation = disney_rated['imdb'].corr(disney_rated['total_gross'])
print("Correlation:", correlation)

Correlation: 0.4379567756734994


Based on the available data, there is a moderate positive correlation between IMDb ratings and the total gross revenue of Disney movies, indicating that higher-rated movies tend to earn more at the box office. However, it's worth noting that while there is a positive correlation here, this doesn't always hold true across the board. IMDb ratings don't always reflect box office performance because the IMDb voting population may not represent the broader movie-going audience. Besides, the dataset does not contain all Disney movies as the IMDb ratings for some of the movies are unavailable/missing.