# Exploratory data analysis of the Disney datasets

## Foreword

This notebook will be showing some exploratory data analysis for the `Disney` dataset located [here](https://data.world/kgarrett/disney-character-success-00-16). Here I am analyzing the `Disney` dataset. More information about that is available on the course `canvas` page.

# Introduction

## Question(s) of interests
In this analysis, I will be investigating a question associated with the collection of Disney datasets.
I am interested in finding out which  has year has the hightest total gross and revenue made. This is interesting because, as you may know, that the revenue and total gross are based on various movies per year. It would be interesting to see which segment from the disney associates with the most revenue out of 4 and would beat the highest part in the total revenue made that year. I would expect the **Disney Media or the Disney Studio** segment to have the highest revenue over most of the year during 1991 till 2016.

## Dataset description 

The below descripitions were taken directly from the [website](https://data.world/kgarrett/disney-character-success-00-16) where the datasets were obtained.

Spanning over a decade, from the turn of the millennium to 2016, this dataset compiles data related to Disney movies, their attributes, and various factors that contribute to their success. These factors may include their appearances in movies,genre,gross incomde, revenue and much more. Exploring this dataset allows us to uncover trends, patterns, and insights into what makes a Disney movies iconic and enduring in the hearts of fans.

The Disney dataset is composed of $5$ tables, `disney-characters.csv`, `disney-director.csv`, `disney-voice-actors.csv`, `disney_revenue_1991-2016.csv`, `disney-movies-total_gross.csv`. Each table is stored in a `.csv` file and contains different information about disney including movie characters, directors, voice actors, total gross, and revenues they made per year. I will be using the `revenue` and `total_gross` tables formally described below:

* **disney_revenue_1991-2016.csv**
    * This file contains information on disney movies , including a u, the name of the set, the year it was released, its theme and how many parts it includes.
* **disney-movies-total_gross.csv**
    * This file includes information on disney movies. Each movie has its index number, a movie title, release date, genre, its MPAA rating, total gross and the inflation adjusted gross.

# Methods and Results

Since I am only interested in computing the total gross and revenue per year, I will need to use tables that contain information on total gross and revenue. This implies that I will need to use the **total_gross** and the **revenue** tables, i will still import **characters** table for later use if needed.

However, before moving further, let us import the tables and do some basic visualizations.

In [1]:
# Lets import all the required libraries needed for this analysis
import pandas as pd
import altair as alt
import numpy as np
from add_new_entry import add_new_entry

# import all the required files
total_gross = pd.read_csv("data/disney_movies_total_gross.csv",parse_dates=['release_date'])
characters = pd.read_csv("data/disney-characters.csv",parse_dates=['release_date'])
revenue = pd.read_csv("data/disney_revenue_1991-2016.csv")

Lets see what the tables look like.

In [2]:
total_gross.head()

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,1940-02-09,Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,1940-11-13,Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,1946-11-12,Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,1950-02-15,Drama,G,"$85,000,000","$920,608,730"


In [3]:
characters.head()

Unnamed: 0,movie_title,release_date,hero,villian,song
0,\nSnow White and the Seven Dwarfs,1937-12-21,Snow White,Evil Queen,Some Day My Prince Will Come
1,\nPinocchio,1940-02-07,Pinocchio,Stromboli,When You Wish upon a Star
2,\nFantasia,1940-11-13,,Chernabog,
3,Dumbo,1941-10-23,Dumbo,Ringmaster,Baby Mine
4,\nBambi,1942-08-13,Bambi,Hunter,Love Is a Song


In [4]:
revenue.head()

Unnamed: 0,Year,Studio Entertainment[NI 1],Disney Consumer Products[NI 2],Disney Interactive[NI 3][Rev 1],Walt Disney Parks and Resorts,Disney Media Networks,Total
0,1991,2593.0,724.0,,2794.0,,6111
1,1992,3115.0,1081.0,,3306.0,,7502
2,1993,3673.4,1415.1,,3440.7,,8529
3,1994,4793.0,1798.2,,3463.6,359.0,10414
4,1995,6001.5,2150.0,,3959.8,414.0,12525


Lets get some other information about the **total gross** table.

In [5]:
total_gross.dtypes

movie_title                         object
release_date                datetime64[ns]
genre                               object
MPAA_rating                         object
total_gross                         object
inflation_adjusted_gross            object
dtype: object

In [6]:
total_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               579 non-null    object        
 1   release_date              579 non-null    datetime64[ns]
 2   genre                     562 non-null    object        
 3   MPAA_rating               523 non-null    object        
 4   total_gross               579 non-null    object        
 5   inflation_adjusted_gross  579 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 27.3+ KB


The sets table has $579$ rows and $6$ columns. Every **movie title** has a **release date**, a **genre**, **MPAA rating**, **total gross** and the **inflation adjusted gross** 

First Lets get rid of all the null values or NAN values:

In [7]:
total_gross_cleaned = total_gross.dropna()
total_gross_cleaned

Unnamed: 0,movie_title,release_date,genre,MPAA_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,"$184,925,485","$5,228,953,251"
1,Pinocchio,1940-02-09,Adventure,G,"$84,300,000","$2,188,229,052"
2,Fantasia,1940-11-13,Musical,G,"$83,320,000","$2,187,090,808"
3,Song of the South,1946-11-12,Adventure,G,"$65,000,000","$1,078,510,579"
4,Cinderella,1950-02-15,Drama,G,"$85,000,000","$920,608,730"
...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,PG-13,"$12,545,979","$12,545,979"
575,Queen of Katwe,2016-09-23,Drama,PG,"$8,874,389","$8,874,389"
576,Doctor Strange,2016-11-04,Adventure,PG-13,"$232,532,923","$232,532,923"
577,Moana,2016-11-23,Adventure,PG,"$246,082,029","$246,082,029"


In [8]:
total_gross_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 513 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               513 non-null    object        
 1   release_date              513 non-null    datetime64[ns]
 2   genre                     513 non-null    object        
 3   MPAA_rating               513 non-null    object        
 4   total_gross               513 non-null    object        
 5   inflation_adjusted_gross  513 non-null    object        
dtypes: datetime64[ns](1), object(5)
memory usage: 28.1+ KB


Changing the data types of the columns that are needed and getting rid of **$** sign:

In [9]:
# Remove dollar signs and commas and convert to integers
total_gross_cleaned['total_gross'] = total_gross_cleaned['total_gross'].replace('[$,]', '', regex=True).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_gross_cleaned['total_gross'] = total_gross_cleaned['total_gross'].replace('[$,]', '', regex=True).astype(int)


In [10]:
# Remove dollar signs and commas and convert to integers
total_gross_cleaned['inflation_adjusted_gross'] = total_gross_cleaned['inflation_adjusted_gross'].replace('[$,]', '', regex=True).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_gross_cleaned['inflation_adjusted_gross'] = total_gross_cleaned['inflation_adjusted_gross'].replace('[$,]', '', regex=True).astype(int)


In [11]:
total_gross_cleaned.dtypes

movie_title                         object
release_date                datetime64[ns]
genre                               object
MPAA_rating                         object
total_gross                          int64
inflation_adjusted_gross             int64
dtype: object

In [12]:
total_gross_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 513 entries, 0 to 578
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   movie_title               513 non-null    object        
 1   release_date              513 non-null    datetime64[ns]
 2   genre                     513 non-null    object        
 3   MPAA_rating               513 non-null    object        
 4   total_gross               513 non-null    int64         
 5   inflation_adjusted_gross  513 non-null    int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 28.1+ KB


In [13]:
total_gross_cleaned['movie_title'].str.title()

0      Snow White And The Seven Dwarfs
1                            Pinocchio
2                             Fantasia
3                    Song Of The South
4                           Cinderella
                    ...               
574           The Light Between Oceans
575                     Queen Of Katwe
576                     Doctor Strange
577                              Moana
578       Rogue One: A Star Wars Story
Name: movie_title, Length: 513, dtype: object

In [14]:
total_gross_cleaned['genre'].str.lower()

0        musical
1      adventure
2        musical
3      adventure
4          drama
         ...    
574        drama
575        drama
576    adventure
577    adventure
578    adventure
Name: genre, Length: 513, dtype: object

In [15]:
total_gross_cleaned['MPAA_rating'].str.upper()

0          G
1          G
2          G
3          G
4          G
       ...  
574    PG-13
575       PG
576    PG-13
577       PG
578    PG-13
Name: MPAA_rating, Length: 513, dtype: object

In [16]:
# Split 'MPAA_rating' into 'MPAA_ratings' and 'Rating' columns
try:
    total_gross_cleaned[['MPAA_ratings', 'Rating']] = total_gross_cleaned['MPAA_rating'].str.split('-', 1, expand=True)
except ValueError:
    total_gross_cleaned['MPAA_ratings'] = total_gross_cleaned['MPAA_rating']
    total_gross_cleaned['Rating'] = np.nan
print(total_gross_cleaned)

                         movie_title release_date      genre MPAA_rating  \
0    Snow White and the Seven Dwarfs   1937-12-21    Musical           G   
1                          Pinocchio   1940-02-09  Adventure           G   
2                           Fantasia   1940-11-13    Musical           G   
3                  Song of the South   1946-11-12  Adventure           G   
4                         Cinderella   1950-02-15      Drama           G   
..                               ...          ...        ...         ...   
574         The Light Between Oceans   2016-09-02      Drama       PG-13   
575                   Queen of Katwe   2016-09-23      Drama          PG   
576                   Doctor Strange   2016-11-04  Adventure       PG-13   
577                            Moana   2016-11-23  Adventure          PG   
578     Rogue One: A Star Wars Story   2016-12-16  Adventure       PG-13   

     total_gross  inflation_adjusted_gross MPAA_ratings Rating  
0      184925485      

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [17]:
total_gross_cleaned.drop(columns=['Rating', 'MPAA_rating','1'],errors='ignore',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [18]:
total_gross_cleaned

Unnamed: 0,movie_title,release_date,genre,total_gross,inflation_adjusted_gross,MPAA_ratings
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,184925485,5228953251,G
1,Pinocchio,1940-02-09,Adventure,84300000,2188229052,G
2,Fantasia,1940-11-13,Musical,83320000,2187090808,G
3,Song of the South,1946-11-12,Adventure,65000000,1078510579,G
4,Cinderella,1950-02-15,Drama,85000000,920608730,G
...,...,...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,12545979,12545979,PG
575,Queen of Katwe,2016-09-23,Drama,8874389,8874389,PG
576,Doctor Strange,2016-11-04,Adventure,232532923,232532923,PG
577,Moana,2016-11-23,Adventure,246082029,246082029,PG


As a first visualization, lets look at the average number of movies released in each year. To do this, I will use the **total_gross** table. I will group by year and then compute the average number of parts for each year.

In [19]:
# Now, you can calculate the mean
gross_year_group = total_gross_cleaned.groupby(total_gross_cleaned['release_date'].dt.year)['total_gross'].mean()

# Reset the index so you can plot using Altair
gross_year_group = gross_year_group.reset_index()
gross_year_group = gross_year_group.sort_values(by='total_gross')
gross_year_group

Unnamed: 0,release_date,total_gross
6,1962,9230769.0
17,1993,28777850.0
21,1997,34939870.0
19,1995,35373880.0
18,1994,40332530.0
20,1996,42000380.0
14,1990,43767900.0
16,1992,44431880.0
15,1991,45013350.0
26,2002,55203290.0


Now that we have it in the proper format, we can generate a bar plot to visualize it.

In [20]:
# Use altair to generate a bar plot
total_gross_year_plot = (
    alt.Chart(gross_year_group, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X("release_date:O", title="Release Year"),
        y=alt.Y("total_gross:Q", title="Average of Total Gross"),
    )
    .properties(title="Average of Total Gross of Disney movies released by Year")
)
total_gross_year_plot

From the above plot, there appears to be an increasing trend in the number of parts over the years. There are, however, some years that had an abnormally high number of parts. For example, `1967` had a much higher average number of parts than its neighboring years. Another interesting phenomenon happens between `1988` and `2002` and between `2004` and `2012`. The average number of parts had a drastic decrease in the average number of parts. 

Lets get some other information about the **revenue** table.

In [21]:
revenue.head()

Unnamed: 0,Year,Studio Entertainment[NI 1],Disney Consumer Products[NI 2],Disney Interactive[NI 3][Rev 1],Walt Disney Parks and Resorts,Disney Media Networks,Total
0,1991,2593.0,724.0,,2794.0,,6111
1,1992,3115.0,1081.0,,3306.0,,7502
2,1993,3673.4,1415.1,,3440.7,,8529
3,1994,4793.0,1798.2,,3463.6,359.0,10414
4,1995,6001.5,2150.0,,3959.8,414.0,12525


In [22]:
revenue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Year                             26 non-null     int64  
 1   Studio Entertainment[NI 1]       25 non-null     float64
 2   Disney Consumer Products[NI 2]   24 non-null     float64
 3   Disney Interactive[NI 3][Rev 1]  12 non-null     float64
 4   Walt Disney Parks and Resorts    26 non-null     float64
 5   Disney Media Networks            23 non-null     object 
 6   Total                            26 non-null     int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 1.5+ KB


The **revenue** table has $26$ rows with $7$ columns. **Every year** index has an revenue from **Studio Entertainment**, from **Disney Consumer Products**, from **Disney Interactive** , from **Walt Disney Parks and Resorts** and also from **Disney Media Networks**. However, for this analysis, only themes that are self contained are considered.

In [23]:
revenue.rename(columns={'Studio Entertainment[NI 1]':'Studio_Entertainment'}, inplace = True)

In [24]:
revenue.rename(columns={'Disney Consumer Products[NI 2]':'Disney_Products'}, inplace = True)

In [25]:
revenue.rename(columns={'Disney Interactive[NI 3][Rev 1]':'Disney_Interactive'}, inplace = True)

In [26]:
revenue.rename(columns={'Walt Disney Parks and Resorts':'Disney_Parks_Resorts'}, inplace = True)

In [27]:
revenue.rename(columns={'Disney Media Networks':'Disney_Media_Networks'}, inplace = True)

Droping the Na values from **Disney Media Networks**

In [28]:
revenue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   26 non-null     int64  
 1   Studio_Entertainment   25 non-null     float64
 2   Disney_Products        24 non-null     float64
 3   Disney_Interactive     12 non-null     float64
 4   Disney_Parks_Resorts   26 non-null     float64
 5   Disney_Media_Networks  23 non-null     object 
 6   Total                  26 non-null     int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 1.5+ KB


In [29]:
revenue['Disney_Media_Networks'] = pd.to_numeric(revenue['Disney_Media_Networks'].str.replace(',', '', regex=True))

In [30]:
revenue['Disney_Media_Networks'].astype(float)

0         NaN
1         NaN
2         NaN
3       359.0
4       414.0
5      4142.0
6      6522.0
7      7142.0
8      7512.0
9      9615.0
10     9569.0
11     9733.0
12    10941.0
13    11778.0
14    13207.0
15    14368.0
16    15046.0
17    15857.0
18    16209.0
19    17162.0
20    18714.0
21    19436.0
22    20356.0
23    21152.0
24    23264.0
25    23689.0
Name: Disney_Media_Networks, dtype: float64

In [31]:
mean_value = revenue['Disney_Media_Networks'].mean()
revenue['Disney_Media_Networks'].fillna(mean_value, inplace=True)

In [32]:
revenue['Disney_Media_Networks']

0     12877.695652
1     12877.695652
2     12877.695652
3       359.000000
4       414.000000
5      4142.000000
6      6522.000000
7      7142.000000
8      7512.000000
9      9615.000000
10     9569.000000
11     9733.000000
12    10941.000000
13    11778.000000
14    13207.000000
15    14368.000000
16    15046.000000
17    15857.000000
18    16209.000000
19    17162.000000
20    18714.000000
21    19436.000000
22    20356.000000
23    21152.000000
24    23264.000000
25    23689.000000
Name: Disney_Media_Networks, dtype: float64

In [33]:
revenue.dtypes

Year                       int64
Studio_Entertainment     float64
Disney_Products          float64
Disney_Interactive       float64
Disney_Parks_Resorts     float64
Disney_Media_Networks    float64
Total                      int64
dtype: object

In [34]:
revenue['Studio_Entertainment']

0     2593.0
1     3115.0
2     3673.4
3     4793.0
4     6001.5
5        NaN
6     6981.0
7     6849.0
8     6548.0
9     5994.0
10    7004.0
11    6465.0
12    7364.0
13    8713.0
14    7587.0
15    7529.0
16    7491.0
17    7348.0
18    6136.0
19    6701.0
20    6351.0
21    5825.0
22    5979.0
23    7278.0
24    7366.0
25    9441.0
Name: Studio_Entertainment, dtype: float64

In [35]:
revenue['Studio_Entertainment'].fillna(revenue['Studio_Entertainment'].mean(),inplace=True)

In [36]:
revenue['Studio_Entertainment']

0     2593.000
1     3115.000
2     3673.400
3     4793.000
4     6001.500
5     6445.036
6     6981.000
7     6849.000
8     6548.000
9     5994.000
10    7004.000
11    6465.000
12    7364.000
13    8713.000
14    7587.000
15    7529.000
16    7491.000
17    7348.000
18    6136.000
19    6701.000
20    6351.000
21    5825.000
22    5979.000
23    7278.000
24    7366.000
25    9441.000
Name: Studio_Entertainment, dtype: float64

In [37]:
revenue['Disney_Products'].fillna(revenue['Disney_Products'].mean(),inplace=True)

As we can see we have NaN values for Disney Interactive and it would not be a good idea to do a fill on it. Rather we can just drop the whole column

In [38]:
revenue.drop(columns=['Disney_Interactive'],errors='ignore',inplace=True)
revenue

Unnamed: 0,Year,Studio_Entertainment,Disney_Products,Disney_Parks_Resorts,Disney_Media_Networks,Total
0,1991,2593.0,724.0,2794.0,12877.695652,6111
1,1992,3115.0,1081.0,3306.0,12877.695652,7502
2,1993,3673.4,1415.1,3440.7,12877.695652,8529
3,1994,4793.0,1798.2,3463.6,359.0,10414
4,1995,6001.5,2150.0,3959.8,414.0,12525
5,1996,6445.036,2591.054167,4502.0,4142.0,18739
6,1997,6981.0,3782.0,5014.0,6522.0,22473
7,1998,6849.0,3193.0,5532.0,7142.0,22976
8,1999,6548.0,3030.0,6106.0,7512.0,23402
9,2000,5994.0,2602.0,6803.0,9615.0,25402


In [39]:
revenue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   26 non-null     int64  
 1   Studio_Entertainment   26 non-null     float64
 2   Disney_Products        26 non-null     float64
 3   Disney_Parks_Resorts   26 non-null     float64
 4   Disney_Media_Networks  26 non-null     float64
 5   Total                  26 non-null     int64  
dtypes: float64(4), int64(2)
memory usage: 1.3 KB


In [40]:
# Now, you can calculate the mean
total_year_group = revenue.groupby(revenue['Year'])['Total'].mean()

# Reset the index so you can plot using Altair
total_year_group = total_year_group.reset_index()
total_year_group = total_year_group.sort_values(by='Total')
total_year_group

Unnamed: 0,Year,Total
0,1991,6111
1,1992,7502
2,1993,8529
3,1994,10414
4,1995,12525
5,1996,18739
6,1997,22473
7,1998,22976
8,1999,23402
11,2002,25360


In [41]:
# Use altair to generate a bar plot
total_rev_year_plot = (
    alt.Chart(total_year_group, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X("Year:O", title="Year"),
        y=alt.Y("Total:Q", title="Average of Total Revenue"),
    )
    .properties(title="Average of Total Revenue of Disney make from different sources per Year")
)
total_rev_year_plot

In [42]:
# Calculate mean revenue for each segment
Studio_Entertainment = revenue['Studio_Entertainment'].mean()
Disney_Products = revenue['Disney_Products'].mean()
Disney_Parks_Resorts = revenue['Disney_Parks_Resorts'].mean()
Disney_Media_Networks = revenue['Disney_Media_Networks'].mean()

# Create a DataFrame
data = pd.DataFrame({
    'Segment': ['Studio Entertainment', 'Disney Products', 'Disney Parks & Resorts', 'Disney Media Networks'],
    'Average Total Revenue': [Studio_Entertainment, Disney_Products, Disney_Parks_Resorts, Disney_Media_Networks]
})

In [43]:
rev_year_plot = (
    alt.Chart(data, width=500, height=300)
    .mark_bar()
    .encode(
        x=alt.X("Segment:N", title="Segment"),
        y=alt.Y("Average Total Revenue:Q", title="Average Total Revenue"),
    )
    .properties(title="Average Total Revenue of Disney from Different Sources")
)

# Show the bar chart
rev_year_plot

Now lets group by name and count the frequency of the 'parent_id'. to do this, I will import and use the script I created with a custom function that takes in a dataframe and groups it by a certian column and then applies a specified aggreating function.

In [44]:
from add_new_entry import add_new_entry

In [53]:
!black add_new_entry.py

[1mreformatted add_new_entry.py[0m
[1mAll done! ✨ 🍰 ✨[0m
[1m1 file reformatted[0m.[0m


In [55]:
!pytest test_entry.py

platform linux -- Python 3.8.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/prog-python-ds-students/release/final_project
plugins: anyio-3.2.1, dash-1.20.0
collected 1 item                                                               [0m[1m

test_entry.py [32m.[0m[32m                                                          [100%][0m



In [49]:
from test_entry import add_new_entry

In [51]:
add_new_entry(revenue,2023,54864.0,11555.0,1158.0,161496.0)

The data entry for the year 2023 has been added to the DataFrame.


Unnamed: 0,Year,Studio_Entertainment,Disney_Products,Disney_Parks_Resorts,Disney_Media_Networks,Total
0,1991.0,2593.0,724.0,2794.0,12877.695652,6111.0
1,1992.0,3115.0,1081.0,3306.0,12877.695652,7502.0
2,1993.0,3673.4,1415.1,3440.7,12877.695652,8529.0
3,1994.0,4793.0,1798.2,3463.6,359.0,10414.0
4,1995.0,6001.5,2150.0,3959.8,414.0,12525.0
5,1996.0,6445.036,2591.054167,4502.0,4142.0,18739.0
6,1997.0,6981.0,3782.0,5014.0,6522.0,22473.0
7,1998.0,6849.0,3193.0,5532.0,7142.0,22976.0
8,1999.0,6548.0,3030.0,6106.0,7512.0,23402.0
9,2000.0,5994.0,2602.0,6803.0,9615.0,25402.0


# Discussions

Report on Data Analysis Summary:
The research conducted in this publication analyzes at the overall income and gross of Disney films over a number of years. In order to obtain insights about Disney's performance, the analysis include importing, cleaning, and creating representations of the data.As per my question this findings proves the revenue of disney media is the highest amongst all 4 segments during majority of the years.

**Cleaning and Importing Data:**

1. To make the analysis easier, libraries like Pandas, Altair, NumPy, and a special module called "add_new_entry" were imported.
2. For additional analysis, data from three tables—"total_gross," "characters," and "revenue"—were imported.
3. Release dates, genres, and gross receipts for Disney films are all shown in the 'total_gross' table.
4. Disney movie character information may be found in the 'characters' table.
5. Revenue information for different Disney segments, including Studio Entertainment, Consumer Products, Parks and Resorts, is included in the "revenue" table.

**Visualizations of Data:**

1. The average total gross income of the Disney films released annually was first determined in the analysis. The average total gross was calculated by grouping the data according to the release year in the 'total_gross' table.

2. To show the average total gross of Disney films by year of release, a bar plot was created. With some variations across the years, the plot demonstrated a growing tendency.

3. Disney's revenue across several segments was examined using the'revenue' table. For clarity, certain columns were given new names.

4. The mean value filled the null entries in the 'Disney_Media_Networks' column.

5. The 'Disney_Interactive' field was eliminated from the analysis due to an excessive number of null values.

6. Disney's success over the years in different segments was examined using the cleansed data.

To sum up, the data analysis offers insightful information on Disney's film success in terms of both overall gross revenue and revenue from various business sectors. Disney's financial success throughout time may be better understood by using the visualizations to spot trends and irregularities. More thorough understandings and conclusions may result from more investigation and study of this data.

# References

Not all the work in this notebook is original. Some parts were borrowed from online resources. I take no credit for parts that are not mine.

## Resources used
* [Data Source](https://data.world/kgarrett/disney-character-success-00-16)
    * This Disney database used in this work was curated by **Kelly Garret*.
    The Numbers, “Movies Released by Walt Disney”
    ○ It is a chart and provides a list of Disney movies, and their genre, gross, and
    MPAA ratings.
    Wikipedia, “Annual gross revenues of The Walt Disney Company”
    ○ This is a Disney financial data chart which contains annual gross revenues by
        sections (includes studio entertainment, parks and resorts, etc.) from 1991-2016.
    The data are collected from the Disney annual report.
* [Data Visualization](https://www.kaggle.com/asindico/data-exploration)
    * Inspiration for generating the plotting the average number of parts over the years was taken from **sample provided by UBC**..