###                                                **A LOOK INTO DISNEY+ SHOWS AND MOVIES [PYTHON EXPLORATORY DATA ANALYSIS]**

In this exploratory data analysis (EDA), we are looking at a data set of all shows/movies avaiable on Disney+ acquired in May 2022 in the United States (via Kaggle) to see if there are any interesting correlations that may lead use to understand what criteria makes a show or movie well rated. By understanding these factors, perhaps we can have better insights on how to create a better reccomendation system and new customer sucustomer retention/engagement.

### **Setting Up Environment**
##### Importing Necessary Libraries

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

##### Loading in Data

In [3]:
credits = pd.read_csv(r'C:\Users\Kevin\Documents\Disney_Data\credits.csv', index_col=False) 
titles = pd.read_csv(r'C:\Users\Kevin\Documents\Disney_Data\titles.csv', index_col=False)

##### Data Cleaning

Now lets take a brief look at the datasets

In [4]:
titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    1535 non-null   object 
 1   title                 1535 non-null   object 
 2   type                  1535 non-null   object 
 3   description           1529 non-null   object 
 4   release_year          1535 non-null   int64  
 5   age_certification     1210 non-null   object 
 6   runtime               1535 non-null   int64  
 7   genres                1535 non-null   object 
 8   production_countries  1535 non-null   object 
 9   seasons               415 non-null    float64
 10  imdb_id               1133 non-null   object 
 11  imdb_score            1108 non-null   float64
 12  imdb_votes            1105 non-null   float64
 13  tmdb_popularity       1524 non-null   float64
 14  tmdb_score            1426 non-null   float64
dtypes: float64(5), int64(

In [12]:
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,tm74391,Fantasia,MOVIE,Walt Disney's timeless masterpiece is an extra...,1940,G,120,"['animation', 'family', 'music', 'fantasy']",['US'],,tt0032455,7.7,94681.0,57.751,7.4
1,tm67803,Snow White and the Seven Dwarfs,MOVIE,"A beautiful girl, Snow White, takes refuge in ...",1937,G,83,"['fantasy', 'family', 'romance', 'animation', ...",['US'],,tt0029583,7.6,195321.0,107.137,7.1
2,tm82546,Pinocchio,MOVIE,Lonely toymaker Geppetto has his wishes answer...,1940,G,88,"['animation', 'comedy', 'family', 'fantasy']",['US'],,tt0032910,7.5,141937.0,71.16,7.1
3,tm79357,Bambi,MOVIE,Bambi's tale unfolds from season to season as ...,1942,G,70,"['animation', 'drama', 'family']",['US'],,tt0034492,7.3,140406.0,68.136,7.0
4,tm62671,Treasure Island,MOVIE,Enchanted by the idea of locating treasure bur...,1950,PG,96,"['family', 'action']","['GB', 'US']",,tt0043067,6.9,8229.0,10.698,6.5


In [6]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26412 entries, 0 to 26411
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  26412 non-null  int64 
 1   id         26412 non-null  object
 2   name       26412 non-null  object
 3   character  24769 non-null  object
 4   role       26412 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.0+ MB


In [7]:
credits.head()

Unnamed: 0,person_id,id,name,character,role
0,23433,tm74391,Deems Taylor,Narrator - Narrative Introductions,ACTOR
1,5910,tm74391,Walt Disney,Mickey Mouse (segment 'The Sorcerer's Apprenti...,ACTOR
2,23436,tm74391,Julietta Novis,Soloist (segment 'Ave Maria') (singing voice),ACTOR
3,23434,tm74391,Leopold Stokowski,Himself - Conductor of The Philadelphia Orchestra,ACTOR
4,23441,tm74391,Paul Satterfield,,DIRECTOR


##### Check for Duplicates

In [13]:
duplicateTRows = titles[titles.duplicated(keep='last')]
duplicateTRows.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score


In [14]:
duplicateCRows2 = credits[credits.duplicated(keep='last')]
duplicateCRows2.head()

Unnamed: 0,person_id,id,name,character,role


We can see that the 'titles' dataset is comprised of 15 columns which various identifiers, while the 'credits' data set is comprised of 5 columns which describes a person and their association to their respective titles. There also seems to be no duplicates in either dataset, so nothing is required to be transformed in that aspect. However, there are differing amount of entries in the columns of interest we want to analyze such as the imdb/tmdb score and   in the genre column of the titles dataset, it is comprised of multiple values so performing an analysis on the genres would not be possible until I split the columns and pivot them to normalize it to 1NF (sidenote: doing so would create duplicates so this step would preferably be done last to not complicate and interefere with the analysis in other aspects).

### Analysis
I am curious to see the total distribution of movies and shows across the Release Year History to see how extensive Disney+'s library is.   

In [10]:
fig = px.histogram(pd.DataFrame(titles), x = "release_year", color = "type", 
                   marginal= "rug", title="Release Year History of Disney+ Catalog (US)", 
                   labels= { "type" : "Type", "release_year" : "Release Year"},
                   color_discrete_map= {"MOVIE": "#BFF5FD", "SHOW" : "#C39BD3"},
                   template="plotly_dark")

fig.update_layout(paper_bgcolor = '#1A1D29', 
                  plot_bgcolor = '#1A1D29',
                  font = dict(family="Verdana",color = '#FFFFFF', size=13),
                  barmode='overlay'
                 )

fig.show()

It appears the oldest movie and show date back to 1928 and 1955 respectively. The catalog shows a left skew towards the more current years, however that is to be expected as there is probably more content avaliable and that digital formats are a very modern implementation. Now lets look at how many movies and shows there are respectively. 

In [11]:
fig = px.bar(pd.DataFrame(titles), x ="type", 
                   title="Release Year History of Disney+ Library (US)", 
                   labels= { "type" : "Type", "release_year" : "Release Year"},
                   color_discrete_map= {"MOVIE": "red", "SHOW" : "blue"},
                   template="plotly_dark")

fig.show()