- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.



#### **INTRO**



- In this study, we are going to make Exploratory Data Analysis (EDA) with the Hollywood's Most Profitable Movies dataset. 
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 74 movies along with their ratings, profitability, worldwide gross and leading studio.
- Data includes 2007-2011 movies.

- First, let's import the required libraries.
- We will use Plotly's interactive environment for visualization.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [2]:
df= pd.read_csv('../input/hollywood-most-profitable-stories/HollywoodsMostProfitableStories.csv')

In [3]:
df.head()

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
0,27 Dresses,Comedy,Fox,71.0,5.343622,40.0,160.308654,2008
1,(500) Days of Summer,Comedy,Fox,81.0,8.096,87.0,60.72,2009
2,A Dangerous Method,Drama,Independent,89.0,0.448645,79.0,8.972895,2011
3,A Serious Man,Drama,Universal,64.0,4.382857,89.0,30.68,2009
4,Across the Universe,Romance,Independent,84.0,0.652603,54.0,29.367143,2007


In [4]:
df.shape

(74, 8)

- We have 74 films and 8 variables to work on

In [5]:
df.isnull().sum()

Film                 0
Genre                0
Lead Studio          1
Audience  score %    1
Profitability        3
Rotten Tomatoes %    1
Worldwide Gross      0
Year                 0
dtype: int64

- We have several missing values, which we need to look into.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Film               74 non-null     object 
 1   Genre              74 non-null     object 
 2   Lead Studio        73 non-null     object 
 3   Audience  score %  73 non-null     float64
 4   Profitability      71 non-null     float64
 5   Rotten Tomatoes %  73 non-null     float64
 6   Worldwide Gross    74 non-null     float64
 7   Year               74 non-null     int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 4.8+ KB


- As a data type, everything is in order to work on.

In [7]:
df.describe()

Unnamed: 0,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
count,73.0,71.0,73.0,74.0,74.0
mean,64.136986,4.74161,47.356164,136.351979,2009.054054
std,13.647665,8.292017,26.242655,157.067561,1.353756
min,35.0,0.005,3.0,0.025,2007.0
25%,52.0,1.79068,27.0,32.4475,2008.0
50%,64.0,2.642353,45.0,73.198612,2009.0
75%,76.0,4.850958,65.0,190.18525,2010.0
max,89.0,66.934,96.0,709.82,2011.0


Before going further, let's summarize what we have got from the dataset.

- Our dataset has 74 films from different genres and lead studios.

- Object data type variables, like genre and lead studio can be grouped and see the differences among them.

- There are several  missing values to look for it. 

-  Numerical variables deserves special attention for further analysis.


- Let's make the necessary adjustments before moving to the analysis part.

#### Missing Values

- Let's remember, which columns have the missing values

In [8]:
df.isnull().sum()

Film                 0
Genre                0
Lead Studio          1
Audience  score %    1
Profitability        3
Rotten Tomatoes %    1
Worldwide Gross      0
Year                 0
dtype: int64

- Since we will use all the films, film column must be without missing values. So we are OK for the film column.
- Genre, Worldwide Gross and Year columns do not have any missing values. Hurray!!
- Let's see the missing values.

In [9]:
df[df['Lead Studio'].isnull()]

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
38,No Reservations,Comedy,,64.0,3.30718,39.0,92.60105,2007


- In addition to domain knowledge and expertise, there are tons of different ways to deal with the missing values.

- Most of the time, people tend to use drop function or use fillna function to use that row in their analysis. As I said, it depends on the data, domain knowledge and importance of the variable in our analysis.

- For our analysis in this dataset, main variable which we want to work on it, is 'Film'. If we have the name of the film and also values for at least some of the other variables, we can use that row.

- Based on aferomentioned points, we can keep this row, it has a lot of usefull information.


In [10]:
df[df['Audience  score %'].isnull()]

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
50,Something Borrowed,Romance,Independent,,1.719514,,60.183,2011


- Audience score and Rotten Tomates scores are good variables to use for rating purposes. 
- But in this row we don't have any of them.
- But still we have values for other columns, so we can keep this row.

In [11]:
df[df['Profitability'].isnull()]

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
18,Jane Eyre,Romance,Universal,77.0,,85.0,30.147,2011
41,Our Family Wedding,Comedy,Independent,49.0,,14.0,21.37,2010
70,When in Rome,Comedy,Disney,44.0,,15.0,43.04,2010


- Profitability has 3 missing values.
- Even though we don't have profitability values in these rows, we have values for other columns.
- So better to keep them.

#### Look at the 'Genre' and 'Lead Studio'

In [12]:
df['Genre'].value_counts()

Comedy       41
Romance      15
Drama        13
Animation     3
Action        1
Fantasy       1
Name: Genre, dtype: int64

- Seems quite OK to use in the groupby.
- Noted

In [13]:
df['Lead Studio'].value_counts()

Independent              19
Warner Bros.             12
Universal                 7
Disney                    7
Fox                       6
Summit                    5
Sony                      4
Paramount                 4
The Weinstein Company     3
20th Century Fox          2
Lionsgate                 2
CBS                       1
New Line                  1
Name: Lead Studio, dtype: int64

- It can be used.
- OK it is not in a perfect shape to use in group by, but not bad at all to use. 
- Noted.

- Everything seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

#### Genre

In [14]:
df['Genre'].value_counts(normalize=True)

Comedy       0.554054
Romance      0.202703
Drama        0.175676
Animation    0.040541
Action       0.013514
Fantasy      0.013514
Name: Genre, dtype: float64

- 55% of the movies are in the Comedy genre
- Romance and Drama follows it 

In [15]:
fig = px.histogram(df, x="Genre", title='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### Lead Studio

In [16]:
df['Lead Studio'].value_counts(normalize=True)

Independent              0.260274
Warner Bros.             0.164384
Universal                0.095890
Disney                   0.095890
Fox                      0.082192
Summit                   0.068493
Sony                     0.054795
Paramount                0.054795
The Weinstein Company    0.041096
20th Century Fox         0.027397
Lionsgate                0.027397
CBS                      0.013699
New Line                 0.013699
Name: Lead Studio, dtype: float64

- Independent studios make 26% of the all the studios in this dataset.
- Warner Bros also comes close by 16% 

In [17]:
fig = px.histogram(df, x="Lead Studio", title='Lead Studios')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

### Audience Score %

In [18]:
df['Audience  score %'].describe()

count    73.000000
mean     64.136986
std      13.647665
min      35.000000
25%      52.000000
50%      64.000000
75%      76.000000
max      89.000000
Name: Audience  score %, dtype: float64

- Based on the mean and median values, auidence score seems quite normally distributed.
- Both mean and median value is 64.

In [19]:

fig = px.histogram(df, x= 'Audience  score %', title='Percentage of the Audience  score', marginal="box", hover_data = df[['Film','Genre']])


fig.show()

#### Profitability

In [20]:
df['Profitability'].describe()

count    71.000000
mean      4.741610
std       8.292017
min       0.005000
25%       1.790680
50%       2.642353
75%       4.850958
max      66.934000
Name: Profitability, dtype: float64

- We have rightly skewed distribution (mean is signicantly bigger than median)
- Which basicaly means, we have possible outliers in our dataset and they affect mean value.
- Let's see it.

In [21]:
fig = px.histogram(df, x= 'Profitability', title='Profitability of the Movies', marginal="box", hover_data = df[['Film','Genre']])


fig.show()

- Yeah, we have a quite a number of outliers and rightly skewed distribution of the Profitability of the movies.

#### Rotten Tomatoes %

In [22]:
df['Rotten Tomatoes %'].describe()

count    73.000000
mean     47.356164
std      26.242655
min       3.000000
25%      27.000000
50%      45.000000
75%      65.000000
max      96.000000
Name: Rotten Tomatoes %, dtype: float64

- We have maximum number of 96 in the data. It affects mean score.
- We can expect rightly skewed distribution but not that much extend.
- Because mean and median scores are close to each others (47 & 45)

In [23]:
fig = px.histogram(df, x= 'Rotten Tomatoes %', title='Rating Score-Rotten Tomatoes %', marginal="box", hover_data = df[['Film','Genre']])


fig.show()

In [24]:
df['Worldwide Gross'].describe()

count     74.000000
mean     136.351979
std      157.067561
min        0.025000
25%       32.447500
50%       73.198612
75%      190.185250
max      709.820000
Name: Worldwide Gross, dtype: float64

- Yep, as you correctly see, we have highly rightly skewed distribution (mean= 136.5, median = 73.19)
- We have possible outliers.
- Let's see it

In [25]:
fig = px.histogram(df, x= 'Worldwide Gross', title='Worldwide Gross', marginal="box", hover_data = df[['Film','Genre']])
fig.show()

- Befor moving on the details, let's see the correlation matrix for our dataset

In [26]:
df.drop('Year', axis=1).corr()

Unnamed: 0,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross
Audience score %,1.0,0.042083,0.60199,0.395357
Profitability,0.042083,1.0,0.02421,0.146705
Rotten Tomatoes %,0.60199,0.02421,1.0,0.019748
Worldwide Gross,0.395357,0.146705,0.019748,1.0


In [27]:
index_vals = df['Genre'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='Audience  score %',
                                 values=df['Audience  score %']),
                            dict(label='Profitability',
                                 values=df['Profitability']),
                            dict(label='Rotten Tomatoes %',
                                 values=df['Rotten Tomatoes %']),
                            dict(label='Worldwide Gross',
                                 values=df['Worldwide Gross'])],
                showupperhalf=False, 
                text=df['Film'],
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Movies',
    width=1000,
    height=1000,
)

fig.show()

Based on the results:
- There is positive but not so strong relationship (.60) between Audience Score and Rotten Tomatoes
- Also there is positive but weak (.395) relationship between Audience Score and the Worldwide Gross.

- After getting overall picture about the data, we can go into more details.

In [28]:
genre_by_year = df.groupby('Year')['Genre'].value_counts().reset_index(level=0).rename(columns={'Genre': 'Genre count'}, index={'index': 'Genre'})
genre_by_year

Unnamed: 0_level_0,Year,Genre count
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Comedy,2007,6
Romance,2007,5
Comedy,2008,12
Drama,2008,3
Romance,2008,2
Animation,2008,1
Fantasy,2008,1
Comedy,2009,7
Drama,2009,5
Comedy,2010,15


In [29]:
fig = px.line(genre_by_year, x='Year', y='Genre count', color= genre_by_year.index, title='Movies By Genre in Each Year')
fig.show()

- From the line plot we can see that Movies in the Comedy and Drama genres do not have consistency by year
- Movies in Romance genre, increased significantly after 2008 
- Animation movies are stable in count by year. 

### Profitability by Year

In [30]:
fig = px.scatter(df, x='Year', y='Profitability', title='Movies By Profitability in Each Year')
fig.show()

### Worldwide Gross by Year

In [31]:
fig = px.scatter(df, x='Year', y='Worldwide Gross', title='Movies By Worldwide Gross in Each Year')
fig.show()

- Let's look at the top 15 WorldWide Gross Movies

#### Top 15 WorldWide Gross Movies

In [32]:
top_15 = df.sort_values('Worldwide Gross', ascending=False)[:15]
top_15

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
59,The Twilight Saga: New Moon,Drama,Summit,78.0,14.1964,27.0,709.82,2009
62,Twilight: Breaking Dawn,Romance,Independent,68.0,6.383364,26.0,702.17,2011
29,Mamma Mia!,Comedy,Universal,76.0,9.234454,53.0,609.473955,2008
67,WALL-E,Animation,Disney,89.0,2.896019,96.0,521.283432,2008
47,Sex and the City,Comedy,Warner Bros.,81.0,7.221796,49.0,415.253258,2008
61,Twilight,Romance,Summit,82.0,10.180027,49.0,376.661,2008
51,Tangled,Animation,Disney,88.0,1.365692,89.0,355.08,2010
7,Enchanted,Comedy,Disney,80.0,4.005737,93.0,340.487652,2007
57,The Proposal,Comedy,Disney,74.0,7.8675,43.0,314.7,2009
48,Sex and the City 2,Comedy,Warner Bros.,49.0,2.8835,15.0,288.35,2010


In [33]:
fig = px.bar(top_15, x='Film', y= 'Worldwide Gross',  hover_data = top_15[['Year','Genre']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Let's see top 15 Rotten Tomatoes % Rating Movies

#### Top 15 Rotten Tomatoes % Rating Movies

In [34]:
rotten_tomatoes_top_15 = df.sort_values('Rotten Tomatoes %', ascending=False)[:15]
rotten_tomatoes_top_15

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
67,WALL-E,Animation,Disney,89.0,2.896019,96.0,521.283432,2008
31,Midnight in Paris,Romance,Sony,84.0,8.744706,93.0,148.66,2011
7,Enchanted,Comedy,Disney,80.0,4.005737,93.0,340.487652,2007
21,Knocked Up,Comedy,Universal,83.0,6.636402,91.0,219.001261,2007
51,Tangled,Animation,Disney,88.0,1.365692,89.0,355.08,2010
3,A Serious Man,Drama,Universal,64.0,4.382857,89.0,30.68,2009
66,Waitress,Romance,Independent,67.0,11.089742,89.0,22.179483,2007
1,(500) Days of Summer,Comedy,Fox,81.0,8.096,87.0,60.72,2009
45,Rachel Getting Married,Drama,Independent,61.0,1.384167,85.0,16.61,2008
18,Jane Eyre,Romance,Universal,77.0,,85.0,30.147,2011


In [35]:
fig = px.bar(rotten_tomatoes_top_15, x='Film', y= 'Rotten Tomatoes %',  hover_data = rotten_tomatoes_top_15[['Year','Genre']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Let's see top 15 Audience score % Rating Movies

#### Top 15 Audience score % Rating Movies

In [36]:
audience_score_top_15 = df.sort_values('Audience  score %', ascending=False)[:15]
audience_score_top_15

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
2,A Dangerous Method,Drama,Independent,89.0,0.448645,79.0,8.972895,2011
67,WALL-E,Animation,Disney,89.0,2.896019,96.0,521.283432,2008
51,Tangled,Animation,Disney,88.0,1.365692,89.0,355.08,2010
31,Midnight in Paris,Romance,Sony,84.0,8.744706,93.0,148.66,2011
4,Across the Universe,Romance,Independent,84.0,0.652603,54.0,29.367143,2007
35,My Week with Marilyn,Drama,The Weinstein Company,84.0,0.8258,83.0,8.258,2011
21,Knocked Up,Comedy,Universal,83.0,6.636402,91.0,219.001261,2007
61,Twilight,Romance,Summit,82.0,10.180027,49.0,376.661,2008
43,P.S. I Love You,Romance,Independent,82.0,5.103117,21.0,153.093505,2007
53,The Curious Case of Benjamin Button,Fantasy,Warner Bros.,81.0,1.783944,73.0,285.431,2008


In [37]:
fig = px.bar(audience_score_top_15, x='Film', y= 'Audience  score %',  hover_data = audience_score_top_15[['Year','Genre']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Both ratings rated Wall E with the highest score.
- On the other hand, 'Twilight', 'PS I Love You' and 'The Twilight Saga' has lower score in Rotten Tomatoes than Audience Scores.

- And finally, let's see the most profitable  15 movies

#### Top 15 Most Profitable Movies

In [38]:
top_15_profit = df.sort_values('Profitability', ascending=False)[:15]
top_15_profit

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
8,Fireproof,Drama,Independent,51.0,66.934,40.0,33.467,2008
15,High School Musical 3: Senior Year,Comedy,Disney,76.0,22.913136,65.0,252.044501,2008
59,The Twilight Saga: New Moon,Drama,Summit,78.0,14.1964,27.0,709.82,2009
66,Waitress,Romance,Independent,67.0,11.089742,89.0,22.179483,2007
61,Twilight,Romance,Summit,82.0,10.180027,49.0,376.661,2008
29,Mamma Mia!,Comedy,Universal,76.0,9.234454,53.0,609.473955,2008
31,Midnight in Paris,Romance,Sony,84.0,8.744706,93.0,148.66,2011
1,(500) Days of Summer,Comedy,Fox,81.0,8.096,87.0,60.72,2009
57,The Proposal,Comedy,Disney,74.0,7.8675,43.0,314.7,2009
47,Sex and the City,Comedy,Warner Bros.,81.0,7.221796,49.0,415.253258,2008


In [39]:
fig= px.bar(top_15_profit, x='Film', y= 'Profitability',  hover_data = top_15_profit[['Year','Genre']], color='Lead Studio')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Interestingly, or not interestingly, 'Fireproof', the most profitable movie in our dataset, has low score in both of the ratings.

- Almost all of the lead studios have their movies in the most profitable top 15 movies list.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London Bike Sharing EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 