# **Project Name**    - Amazon Prime TV Shows & Movies EDA



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1** - Omkar Nitin Mohite

# **Project Summary -**

The following analysis is about the TV Shows and Movies in Amazon Prime. We are given with two data sets. One contains title, imdb score, release year etc details about 9k+ content available on Amazon prime. Other data set is credits which is dataset about the cast. It only contains actors and directors. Both data sets had duplicate rows, so they were removed. Then some wrangling steps were implemented on titles dataset. It involved changing data types, filling null values, etc. Then Visualization of various variables were done. First some univariate analysis, then bivariate and finally some multivariate analysis were done. Many insights and goals were found with respect to amazon prime. Finally answering the business objectives concluded the case study.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In today's competitive streaming industry, platforms like Amazon Prime Video are constantly expanding their content libraries to cater to diverse audiences. With a growing number of shows and movies available on the platform, data-driven insights play a crucial role in understanding trends, audience preferences, and content strategy.

This data set was created to list all shows available on Amazon Prime streaming, and analyze the data to find interesting facts. This dataset has data available in the United States.

This dataset has 2 csv files and it is a mix of categorical and numeric values.

#### **Define Your Business Objective?**

This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

1. Content Diversity: What genres and categories dominate the platform?
2. Regional Availability: How does content distribution vary across different regions?
3. Trends Over Time: How has Amazon Primeâ€™s content library evolved?
4. IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

     By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import ast
!pip install pycountry
import pycountry as pyc

### Dataset Loading

In [None]:
# Load Dataset
titles_data = pd.read_csv("https://raw.githubusercontent.com/nischal-bellana/AmazonPrimeEDA/refs/heads/master/titles.csv")
credits_data = pd.read_csv("https://raw.githubusercontent.com/nischal-bellana/AmazonPrimeEDA/refs/heads/master/credits.csv")

### Dataset First View

In [None]:
# Dataset First Look
titles_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("titles shape-", titles_data.shape)
print("credits shape", credits_data.shape)


### Dataset Information

In [None]:
# Dataset Info
titles_data.info()

In [None]:
credits_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("titles duplicate rows-", titles_data.duplicated().sum())
print("credits duplicate rows-", credits_data.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(titles_data.isnull().sum())
print(credits_data.isnull().sum())

In [None]:
from matplotlib import axes
# Visualizing the missing values
titles_null_count = titles_data.isnull().sum()
credits_null_count = credits_data.isnull().sum()

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

sns.barplot(x=titles_null_count.values, y=titles_null_count.index, ax=axes[0], hue=titles_null_count.index, palette='viridis')
sns.barplot(x=credits_null_count.values, y=credits_null_count.index, ax=axes[1], hue=credits_null_count.index, palette='viridis')

axes[0].set_title('titles')
axes[0].set_ylabel('Columns')
axes[0].set_xlabel('Null Count')
axes[1].set_title('credits')
axes[1].set_ylabel('Columns')
axes[1].set_xlabel('Null Count')

plt.tight_layout()
plt.subplots_adjust(wspace=0.4, hspace=0.6)
plt.show()

### What did you know about your dataset?

We have two data sets:

1. titles_data - contains list of all tv shows/movies in Amazom Prime Video
'titles' dataset has the TV Show/Movie's ID in JustWatch site as the column id.
It is the primary key in this table.
It has 9871 rows and 15 columns.
Has both numerical and categorical columns.
2. credits_data - contains list of cast involved with the tv shows/movies.
'credits' dataset has no single column as primary key.
Each row is uniquely identifies by person_id, id, name, character.
It has 124234 rows and 5 columns.
Has all categorical columns except person_id which is numerical.

     Both tables are related to each other with the foreign key 'id' and primary key 'id' in credits and titles resp.

'titles' dataset has null values in 6 columns with 'seasons' column having highest count. The dataset also has 3 duplicate rows and we need to remove them. Some rows have null values in imdb data and tmdb data infering that not all movie/shows have imdb id. And not all shows/movies that have imdb id have a imdb score or votes. Same goes for tmdb.

'credits' dataset has null values only in 'character' column. The dataset has 56 duplicate rows that need to be removed. Since Directors dont have a character and are off screen the data checks out.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles's Columns\n", titles_data.columns)
print("Credits's Columns\n", credits_data.columns)

In [None]:
# Dataset Describe
titles_data.describe()

In [None]:
credits_data.describe()

### Variables Description

**Titles's Columns**

1. id - TV Show/Movie's ID in JustWatch website

2. title - Title of the show/movie

3. type - MOVIE/SHOW

4. description - Description of the show/movie

5. release_year - The Year it was released

6. age_certification - Age Certification given

7. runtime - Average episode runtime or Movie overall runtime (in minutes)

8. genres - Genre

9. production_countries - The country it was produced in

10. seasons - No of seasons if it is a show

11. imdb_id - IMBD ID

12. imdb_score - Score in IMBD

13. imdb_votes - No of votes in IMBD

14. tmdb_popularity - Popularity in TMBD

15. tmdb_score - Score in TMBD

**Credits's Columns**

1. person_id - Unique ID of the Actor/Director

2. id - The show/movie he is part of

3. name - Name of the person

4. character - Name of his character if he is an actor

5. role - ACTOR/DIRECTOR

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
titles_data.nunique(axis=0)

In [None]:
credits_data.nunique(axis=0)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Removing Duplicates in both datasets
titles_data.drop_duplicates(inplace=True)
credits_data.drop_duplicates(inplace=True)

print(titles_data.duplicated().sum())
print(credits_data.duplicated().sum())

In [None]:
#Changeing dtypes of some columns
titles_data['seasons'] = titles_data['seasons'].fillna(0).astype(np.int64)
titles_data['imdb_votes'] = titles_data['imdb_votes'].fillna(0).astype(np.int64)
titles_data['genres'] = titles_data['genres'].apply(ast.literal_eval)
titles_data['production_countries'] = titles_data['production_countries'].apply(ast.literal_eval)
titles_data.head()

In [None]:
#Converting Country codes to Country Names
def toCountryNames(code_list):
  country_names = []
  for code in code_list:
    code_match = pyc.countries.get(alpha_2=code)
    country_names.append(code_match.name if code_match else code)

  return country_names

titles_data['production_countries'] = titles_data['production_countries'].apply(toCountryNames)
titles_data.head()


### What all manipulations have you done and insights you found?

Following Manipulations were done:

1. Removed the duplicate columns from both datasets.
2. Changed data types of seasons and imdb votes in titles dataset since both are counts filling NAs with 0s.
3. Also Changed String Literals of genres and production_countries into list type.
4. Changed Country codes in production countries into country names.

   It has been noted that both seasons and imdb votes columns have NA values. Seasons column have NAs due to movie contents. And some movies/tvshows dont have imdb votes. Both the columns were filled NAs with 0.

Now we are ready for analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
fig, ax = plt.subplots(1, 2, figsize=(10, 8))

sns.histplot(titles_data['imdb_score'], bins=20, kde=True, ax=ax[0])
sns.histplot(titles_data['tmdb_score'], bins=20, kde=True, ax=ax[1])
plt.show()

##### 1. Why did you pick the specific chart?

Both the scores are numeric. And it would be insightful to view their distribution curves. So Histogram was chosen.

##### 2. What is/are the insight(s) found from the chart?

**Peaks in middle**: Both imdb and tmdb scores peak at the middle around 6. Most frequently given scores are around 6. This is most common pattern in every ratings.

**Left skewed**: The graphs are slightly left skewed indicating some possible outliers having low scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A lot of the content is rated around 6. These type of content gets consistent response from the audience. Hence future acquirement of content can be based on these types.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.histplot(titles_data['release_year'], bins=20, kde=True)
plt.show()

##### 1. Why did you pick the specific chart?

Choosing histogram helps us observe the frequency of content along the years.

##### 2. What is/are the insight(s) found from the chart?

**Peak at the right end**: The peak at the right end clearly demonstrates the modern speed of content production. With the combination huge popularity of entertainment content and advanced technology that allows speed making, this result comes as no surprise.

**Slight rise at 1940**: With the mass production of Televisions and rising popularity of cinema the demand for film content should also be expected to increase. And thus the humble but significant rise.

**Exponential Increase from 2000s**: Internet becoming popular enabled oppurtunities for online streaming of content and the engagement of the public with the content. Hence the rise in content production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can clearly see the impact of Internet on content. So social media can be used to further improve audience engagement significantly

#### Chart - 3

In [None]:
#Splitting dataset titles based on SHOW/MOVIE type
movie_titles = titles_data[titles_data['type'] == 'MOVIE'].reset_index(drop=True)
show_titles = titles_data[titles_data['type'] == 'SHOW'].reset_index(drop=True)

#Removing Redundant columns in both datasets
movie_titles.drop(['type', 'seasons'], axis=1, inplace=True)
show_titles.drop('type', axis=1, inplace=True)

In [None]:
# Chart - 3 visualization code
fig, ax = plt.subplots(1, 2, figsize=(10, 8))

sns.histplot(movie_titles['runtime'], bins=20, kde=True, ax=ax[0])
sns.histplot(show_titles['runtime'], bins=20, kde=True, ax=ax[1])

ax[0].set_xlabel('Movie Runtime')
ax[1].set_xlabel('Show Runtime')

plt.show()



##### 1. Why did you pick the specific chart?

Runtime is numeric column. So histogram is best to visualize the frequency distribution

##### 2. What is/are the insight(s) found from the chart?

**Movies** - The distribution is normal. There are some exceptional movies which have runtime as long as 500 minutes. But most movies have the runtime around 100 mins which is typical.

**Shows** - The distribution is wavy. It has two peaks at 25 mins and 45 mins. This shows that tv shows are divided into two types based on episode duration. Long form Shows and Short form shows. There also exceptional shows which have 150 mins long episodes. Shorter episodes can actually be a financial strategy as they could save production time and money.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Generally some people prefer shorter episodes compared to longer episodes. This could be a factor that could be included in personalized recommendation to the user by amazon prime.

Even Movie runtimes could be a key measure that can be analysed to determine what is the optimal runtimes that makes a particular user more satisfied.

#### Chart - 4

In [None]:
#Extracting averages respect to genres

genres_separated = titles_data.loc[:, ['genres', 'title', 'imdb_id', 'imdb_score', 'tmdb_score', 'imdb_votes', 'tmdb_popularity']]

genres_separated = genres_separated.explode('genres')

genres_separated


In [None]:
# Chart - 4 visualization code
genres_separated_imdb = genres_separated[genres_separated['imdb_id'].notna()]

order = genres_separated_imdb.groupby('genres')['imdb_score'].median().sort_values(ascending=False)
order = order.index

fig = px.box(
    data_frame=genres_separated_imdb,
    x='genres',
    y='imdb_score',
    color='genres',
    title='Genres vs IMDB Score',
    category_orders={'genres': order}
)

fig.show()

##### 1. Why did you pick the specific chart?

Genres are categorical and imdb score is continuous. Choosing Boxplot enables us to view imdb_score distribution across different genres

##### 2. What is/are the insight(s) found from the chart?

**Top Median Rating Genres**: The top median genres are documentation, reality, history. These genres are consistently liked by the viewers.

**Highest Rating Genres**: Genres like Drama, History, Action have outliers that have rating 9.9 which is the highest in the whole dataset. The genre Drama has more than one outliers which have around 9.7 rating. Action genre actually has median on the lower end potentially meaning the reaction of audience to the genre is heavily vaying from highest rating as 9.9 to lowest rating as 1.5.

**Lowest Median rating Genres**: Genres which have the lowest median are Horror, Scifi, Thriller. This could be because of rushed production and bad cgis due to rushed production schedules.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since Documentation, History and reality genres are having higher median ratings. Amazon prime can focus on acquiring rights to shows featuring these genres.

Genres such as Horror, Scifi and Thriller should treated with caution and requires more scrutiny before getting the streaming rights show/movie.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
order = genres_separated.groupby('genres')['tmdb_score'].median().sort_values(ascending=False)
order = order.index

fig = px.box(
    data_frame=genres_separated,
    x='genres',
    y='tmdb_score',
    color='genres',
    title='Genres vs TMDB Score',
    category_orders={'genres': order}
)

fig.show()


##### 1. Why did you pick the specific chart?

Genres are categorical and tmdb score is continuous. Choosing Boxplot enables us to view tmdb_score distribution across different genres.

##### 2. What is/are the insight(s) found from the chart?

**Top Median Ratings**: Consistent with IMDB Scores the genres reality, Documentation, History and even Animation come onto the top by their medians.

**Highest Ratings**: All genres have outliers that reach upto 10 score. This could be a sign of lower number of voters in tmdb. As that could cause content getting a rating of 10 easily.

**Lowest Median Ratings**: Again consistent with imdb scores horror, scifi, thriller come out to be the least medians.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The data shows same insights from imdb scores. But it must be noted that tmdb scores might be less reliable since there are less voters and the rating process is less strict.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
order = genres_separated_imdb.groupby('genres')['imdb_votes'].median().sort_values(ascending=False)
order = order.index

fig = px.box(
    data_frame=genres_separated_imdb,
    x='genres',
    y='imdb_votes',
    color='genres',
    title='Genres vs IMDB Votes',
    category_orders={'genres': order},
    log_y=True
)

fig.show()


##### 1. Why did you pick the specific chart?

Genres are categorical and imdb_votes is continuous. Choosing Boxplot enables us to view imdb_votes distribution across different genres.

##### 2. What is/are the insight(s) found from the chart?

**Low Ratings More Popularity**: Although Scifi, Horror, thriller genres have the least median ratings they have high median in number of Votes.

**High Ratings Less Popularity**: On the Contrary, Documentation which had high median rating in imdb and tmdb is having second to least median votes.

**High Votes Outliers**: For all the genres there are outliers which are ridiculously large votes. This tells us that no matter the genre there always shows that are world wide famous. Also Every genre has a wide range also reinforcing the idea that all genres have a show at all levels of popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The Moderate popular genres like Romance, Sport etc are of interest to us since they also have moderate rating.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
order = genres_separated.groupby('genres')['tmdb_popularity'].median().sort_values(ascending=False)
order = order.index

fig = px.box(
    data_frame=genres_separated,
    x='genres',
    y='tmdb_popularity',
    color='genres',
    title='Genres vs TMDB Popularity',
    category_orders={'genres': order},
    log_y=True
)

fig.show()

##### 1. Why did you pick the specific chart?

Genres are categorical and tmdb_popularity is continuous. Choosing Boxplot enables us to view tmdb_popularity distribution across different genres.

##### 2. What is/are the insight(s) found from the chart?

**High Popularity Low Rating**: Same insights as we got from imdb votes. Less rated genres like horror, scifi, thriller are quite popular.

**Low Popularity High Rating**: Documentation is again less popular but is having high rating.

**Large Popularity Outliers**: Every genres have atleast a few globally popular shows/movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Same insights as imdb votes.

#### Chart - 8

In [None]:
#Extracting averages resepect to production countries

countries_separated = titles_data.loc[:, ['production_countries', 'title', 'imdb_id', 'imdb_score', 'tmdb_score', 'imdb_votes', 'tmdb_popularity']]

countries_separated = countries_separated.explode('production_countries')

countries_separated

In [None]:
# Chart - 8 visualization code
countries_separated_imdb = countries_separated[countries_separated['imdb_id'].notna()]

order = countries_separated_imdb.groupby('production_countries')['imdb_score'].median().sort_values(ascending=False)
order = order.index

fig = px.box(
    data_frame=countries_separated_imdb,
    x='production_countries',
    y='imdb_score',
    color='production_countries',
    title='Production Countries vs IMDB Score',
    category_orders={'production_countries': order},
)

fig.show()

##### 1. Why did you pick the specific chart?

Production Countries column is categorical and box plot would be useful to observe distribution of imdb score

##### 2. What is/are the insight(s) found from the chart?

**Highest Score**: Higheset imdb score was found from India. This could be due to the classic films which are loved by majority of india.

**Huge Range**: Content from countries like India, United States, Canada could imply huge number of voters and also significant engagement with rating films. Although the content from US is viewed world wide so naturally the voters would be huge.

**High Median Scores**: Countries like French Polynesia, Albania, Lebanon have highest median scores. They also have less range which could be due to less voters from the countries.

**Diverse Content**: We can see that content is vastly diverse with almost 116 countries having some form of content in amazon prime. Some countries such as Bermuda, Khazakastan etc dont have much content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Further analysis of specific catergories of films/shows liked by large range countries like india, US could be beneficial. Trying to up the number of content from less content countries should also be a target.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
order = countries_separated.groupby('production_countries')['tmdb_score'].median().sort_values(ascending=False)
order = order.index

fig = px.box(
    data_frame=countries_separated,
    x='production_countries',
    y='tmdb_score',
    color='production_countries',
    title='Production Countries vs TMDB Score',
    category_orders={'production_countries': order},
)

fig.show()


##### 1. Why did you pick the specific chart?

Production Countries column is categorical and box plot would be useful to observe distribution of tmdb score.

##### 2. What is/are the insight(s) found from the chart?

**Huge Range Countries**: Similiar to imdb scores, Countries India, US, Canada are having large range of scores. This reflects the involvement of world wide audience for US, Canada and huge mass of regional content lovers from India in ratings.

**More Outliers**: As said before, TMDB has substantially low number of voters and so there are more outliers found where the scores reach as high as 10 and aas low as 0.8.

**Highest Medians**: The content from pakistan have a median of 8.6. The range is also not too small. This indicates Pakistan viewers using tmdb as rating platform more than imdb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

TMDB Scores are less strictly calculated. And since the platform has less number of voters, the data should be scrutinized well. But some countries have decent range in tmdb scores and can be used for analysis in that region alone.

High range countries should be an easy targets for deeper analysis.

#### Chart - 10

In [None]:
#Creating a dataset with non null imdb_id
titles_imdb = titles_data.dropna(subset=['imdb_id', 'imdb_score'])

In [None]:
# Chart - 10 visualization code
fig, ax = plt.subplots(2, 1, figsize=(10, 8))

sns.lineplot(data=titles_imdb, x='release_year', y='imdb_score', ax=ax[0])
sns.lineplot(data=titles_imdb, x='release_year', y='imdb_votes', ax=ax[1])

plt.show()

##### 1. Why did you pick the specific chart?

Since we want to analyse trends of number columns line plots would be best.

##### 2. What is/are the insight(s) found from the chart?

**IMDB Score**: There is no strong pattern observed here. The value zigzags rapidly which is normal for trends. The variance shadow is small near 1920s indicating less content from that period. And huge variance shadow in the period 1940-2020. Which indicates increased content and viewership of that content. There was a large peak at around 1926-28 and huge drop in 1930s. This could be due to the content being affected by grace period between the wars and the start of World war 2.

**IMDB Votes**: There is a gradual increase in number of votes. There seem to be peaks at around 1918, 1930, 1945 which could be related to content at the historical milestones.Huge variance in the 2000s. The number of votes is significant and some popular shows have unnatural number of votes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There is a potential to further analyse score patterns and popularity to predict what type of content cause those patterns

#### Chart - 11

In [None]:
# Chart - 11 visualization code
fig, ax = plt.subplots(2, 1, figsize=(10, 8))

sns.lineplot(data=titles_data, x='release_year', y='tmdb_score', ax=ax[0])
sns.lineplot(data=titles_data, x='release_year', y='tmdb_popularity', ax=ax[1])

plt.show()

##### 1. Why did you pick the specific chart?

Since we want to analyse trends of number columns line plots would be best.

##### 2. What is/are the insight(s) found from the chart?

**TMDB Score**: The Pattern is similiar with IMDB Score plot. There are similiar peaks in 1920s and 30s. Variation increased from 1940s and continues to present.

**TMDB Popularity**: The graph is flat with exponential increase in the 2020s. This could be due to the post covid era.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Same insights as with imdb.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
df = show_titles[show_titles['imdb_id'].notna()]
df.loc[:, 'imdb_votes'] = np.log10(1+df.loc[:, 'imdb_votes']).astype(np.int64)

fig = px.scatter(
    data_frame=df,
    x='runtime',
    y='imdb_score',
    color='age_certification',
    size='imdb_votes',
    hover_name='title',
    title='Runtime, Votes, Age Certification vs IMDB Score (Shows)',
)

fig.show()

##### 1. Why did you pick the specific chart?

Scatter plot is best for understanding correlation between numericals

##### 2. What is/are the insight(s) found from the chart?

**No Correlation**: There seems to be no significant correlation between runtime and imdb score.

**Highest Rated Show**: The highest rated show is "The Chosen" which has a rating of 9.4

**Clusters**: There are two clusters in the graph. in ranges 20-30 runtime and 40-60 mins runtime. This indicates the popular pattern of standard long episodes and short episodes.

**Exceptional runtimes**: There are also shows like "Endeavors", "Generation War" which have runtimes around 90 mins. Their scores are also similiar at the values around 8.5.

**Diverses Age certifications**: The Age certifications of the shows also seem to be diverse covering all certifications. But Most Popular shows have certifications TV-MA or TV-14.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The popular runtime ranges are to be noted. Further analysis of runtimes variances thoughout seasons could lead to some insights.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
df = movie_titles[movie_titles['imdb_id'].notna()]
df.loc[:, 'imdb_votes'] = np.log(1+df.loc[:, 'imdb_votes']).astype(np.int64)


fig = px.scatter(
    data_frame=df,
    x='runtime',
    y='imdb_score',
    color='age_certification',
    size='imdb_votes',
    hover_name='title',
    title='Runtime, Votes, Age Certification vs IMDB Score (Movies)'
)

fig.show()


##### 1. Why did you pick the specific chart?

Scatter plot is best for understanding correlation between numericals.

##### 2. What is/are the insight(s) found from the chart?

**No Correlation**: There seems to be no particular correlation between runtime and imdb_score.

**Longest Movie**: "The Greates Story Ever Told"

**Highest rated Movie**: "Quota"

**Clustering**: Most movies have runtimes in the range 70-160.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No Particularly interesting insights.

#### Chart - 14 - Correlation Heatmap

In [None]:
directors_data = pd.merge(titles_imdb.loc[:, ['id', 'title', 'imdb_score']], credits_data.loc[credits_data['role'] == 'DIRECTOR', ['id', 'name', 'role']], on='id', how='inner')
directors_data = directors_data.groupby('name')['imdb_score'].mean().sort_values(ascending=False)

directors_data = directors_data[(directors_data > 9) | (directors_data < 2)]

directors_data


In [None]:
# Correlation Heatmap visualization code
fig = px.bar(
    data_frame=directors_data
    .reset_index()
    .rename(columns={'name': 'Director', 'imdb_score': 'Average Score'}),
    x='Director',
    y='Average Score',
    title='Directors vs Average Scores'
)

fig.show()

##### 1. Why did you pick the specific chart?

Since it is categorical vs numerical bar graph would be best.

##### 2. What is/are the insight(s) found from the chart?

Top Directors: Top directors are Digpal Lanjekar 9.9, T S Suresh Babu 9.3, T. J. Gnanavel 9.3, etc.

Bottom Directors: Bottom Directors are Armen Adilkhanyan 1.5, Darren Doane 1.3, Jason Wright 1.1, etc.

#### Chart - 15 - Pair Plot

In [None]:
actors_data = pd.merge(titles_imdb.loc[:, ['id', 'title', 'imdb_score']], credits_data.loc[credits_data['role'] == 'ACTOR', ['id', 'name', 'role']], on='id', how='inner')
actors_data = actors_data.groupby('name')['imdb_score'].mean().sort_values(ascending=False)

actors_data = actors_data[(actors_data > 9.3) | (actors_data < 1.5)]

actors_data

In [None]:
# Pair Plot visualization code
fig = px.bar(
    data_frame=actors_data
    .reset_index()
    .rename(columns={'name': 'Actor', 'imdb_score': 'Average Score'}),
    x='Actor',
    y='Average Score',
    title='Actors vs Average Scores'
)

fig.show()

##### 1. Why did you pick the specific chart?

Since it is categorical vs numerical bar graph would be best.

##### 2. What is/are the insight(s) found from the chart?

Top actors: Top actors are Surabhi Bhave 9.9, Chinmay Mandlekar 9.9, Ankit Mohan 9.9, etc.

Bottom actors: Bottom actors are Lizzy Carter-Burns 1.1, Darren Richardson 1.1, Tina Shuster 1.1, etc.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on all the analysis and visualization, the following suggestions would let amazon prime improve its service:

1. With the content being diverse among countries, analysing popular content and well scored content could help in deciding on the future content in those regions.
2. Applying more scrutiny on popular genres like Sci Fi, Horror, Thriller is suggested. Setting a minimum bar on production resources used might be implemented.
3. Increasing quantity of content that have genres like Documentation, History can also be suggested.
4. Optimizing runtimes for episodes and movies for better story telling should be further researched.
5. Machine learning can be applied for predicting scores based on patterns obsereved in release year vs imdb score.

# **Conclusion**

The analysis that has been done is just the scratch on the surface. There is still more potential to do more analysis. Some other ideas that come to mind are: Average imdb scores for every unique word used in descriptions or titles, Frequency analysis of the words in descriptions and titles of top rated content, genres that are popular in shows vs genres in movies, length of the title vs imdb scores, etc. We can also apply machine learning and deep learning techniques if we get more data such as seasonally popular genres, Thumbnail content vs rating or popularity, most searched keywords by the users etc.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***