<a href="https://colab.research.google.com/github/harshaguchhait12/EDA-Project-Amazon-Prime-Movies-TV-Shows/blob/main/Another_copy_of_EDATemplate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -
EDA Project: Amazon Prime Movies & TV Shows



# **Project Summary -**

This project focuses on performing Exploratory Data Analysis (EDA) on a dataset containing information about movies and TV shows available on Amazon Prime Video. The main objective of the project is to understand the structure, patterns, and trends within the content library of Amazon Prime.

The analysis begins by loading and inspecting the dataset to understand its size, data types, and completeness. Missing values, duplicates, and inconsistencies are identified and handled appropriately to improve data quality. Descriptive statistics are generated to summarize key features such as release years, duration, genres, and content ratings. Visualizations, including histograms, bar charts, boxplots, and correlation heatmaps, help uncover trends and relationships in the data.

The project also explores content distribution across different categories, such as the number of movies vs. TV shows, most common genres, release year trends, and regional characteristics based on the country of origin. Outlier detection is performed on numerical columns to identify unusual values.

Overall, this EDA project provides meaningful insights into the composition and patterns of Amazon Prime’s content library. The findings can be used for business decisions such as recommending content, understanding user preferences, improving categorization, and supporting data-driven strategies for streaming platforms.Write the summary here within 500-600 words.

# **GitHub Link -**

Another_copy_of_EDATemplate.ipynb Provide your GitHub Link here.



```
# This is formatted as code
```

**Write Problem Statement Here.**
This project aims to solve this problem by performing Exploratory Data Analysis (EDA) on the Amazon Prime Movies and TV Shows dataset. The goal is to identify patterns, detect inconsistencies, analyze content distribution, and extract meaningful insights. The analysis will help answer key questions such as:

What types of content dominate Amazon Prime (movies vs. TV shows)?

Which genres are most popular?

How has the release of content changed over the years?

Which countries contribute the most content?

Are there missing values, duplicates, or outliers that affect data quality?



#### **Define Your Business Objective?**

To analyze Amazon Prime’s content library using EDA in order to understand trends, improve content strategy, enhance user engagement, and support data-driven business decisions.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
data1 = pd.read_csv('/content/credits.csv')

In [None]:
data2 = pd.read_csv('/content/titles.csv')

### Dataset First View

In [None]:
# Dataset First Look
display(data1)

In [None]:
display(data2)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = data1.shape
print(" Total Rows :", rows)
print(" Total Columns :", columns)

In [None]:
rows, columns = data2.shape
print(" Total Rows :", rows)
print(" Total Columns :", columns)

### Dataset Information

In [None]:
# Dataset Info
data1.info()

In [None]:
data2.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = data1.duplicated().sum()
print(" Total Duplicate Rows:", duplicate_count)

In [None]:
data1.drop_duplicates( inplace = True)

In [None]:
data1.duplicated().sum()

In [None]:
data1.dtypes

In [None]:
data1.describe()

In [None]:
duplicate_rows = data1[data1.duplicated()]
print("\n Duplicate Rows:")

In [None]:
data1_no_duplicates = data1.drop_duplicates()

print(" Rows after removing duplicates:", data1_no_duplicates.shape[0])


In [None]:
#2nd
duplicate_count = data2.duplicated().sum()
print(" Total Duplicate Rows:", duplicate_count)

In [None]:
duplicate_rows = data2[data2.duplicated()]
print("\n Duplicate Rows:")

In [None]:
data2_no_duplicates = data2.drop_duplicates()

print(" Rows after removing duplicates:", data2_no_duplicates.shape[0])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count for data 1


# Column-wise missing values
print(" Missing / Null Values in Each Column:")
print(data1.isnull().sum())

In [None]:
print(" Total Missing Values in Dataset:", data1.isnull().sum().sum())


In [None]:
#drop null value in charecter colomn
data1.dropna(subset= ['character'],inplace = True)

In [None]:
data1.isnull().sum()

In [None]:
# Missing Values/Null Values Count for data 2


# Column-wise missing values
print(" Missing / Null Values in Each Column:")
print(data2.isnull().sum())

In [None]:
data2.dropna(subset= ['description'],inplace = True)

In [None]:
data2['age_certification'].mode()[0]

In [None]:
data2['age_certification'].fillna(data2['age_certification'].mode()[0], inplace=True)

In [None]:
data2.isnull().sum()

In [None]:
data2['seasons'].fillna(0, inplace = True)

In [None]:
data2.isnull().sum()

In [None]:
data2.dropna(subset= ['imdb_id'],inplace = True)

In [None]:
data2.isnull().sum()

In [None]:
round(data2['imdb_score'].mean(),1)

In [None]:
data2['imdb_score'].fillna(round(data2['imdb_score'].mean(),1), inplace=True)

In [None]:
data2.shape

In [None]:
data2['imdb_votes'].fillna(0, inplace=True)

In [None]:
data2.isnull().sum()

In [None]:
data2 = data2_no_duplicates.dropna(subset=['tmdb_popularity', 'tmdb_score'])

print("Remaining rows:", data2.shape[0])

In [None]:
data2.shape

### What did you know about your dataset?

The dataset contains detailed information about movies and TV shows available on Amazon Prime. It includes variables such as title, type (Movie or TV Show), release year, genre, country, rating, duration, and popularity metrics like TMDB score and IMDB score. The dataset helps analyze content distribution, quality trends, audience ratings, and production patterns across different years and categories.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
# Dataset Describe

### Variables Description

| **Variable Name**   | **Description**                                                                           |
| ------------------- | ----------------------------------------------------------------------------------------- |
| **show_id**         | Unique identifier assigned to each movie or TV show in the dataset.                       |
| **type**            | Indicates whether the title is a *Movie* or a *TV Show*.                                  |
| **title**           | Name of the movie or TV show.                                                             |
| **director**        | Director(s) of the content. May be blank for some titles.                                 |
| **cast**            | List of main actors involved in the title.                                                |
| **country**         | Country or countries where the movie/show was produced.                                   |
| **date_added**      | The date when the content was added to Amazon Prime.                                      |
| **release_year**    | The year the movie/show was originally released.                                          |
| **rating**          | Content rating (e.g., TV-MA, PG-13, R). Indicates audience suitability.                   |
| **duration**        | Duration of the content. For movies: runtime in minutes. For TV shows: number of seasons. |
| **genres**          | Genre/category of the movie or TV show (e.g., Drama, Comedy, Action).                     |
| **description**     | Short summary describing the plot or theme of the title.                                  |
| **tmdb_popularity** | Popularity score of the title as per The Movie Database (TMDb).                           |
| **tmdb_score**      | User rating/score on TMDb (typically 1–10).                                               |
Answer Here

### Check Unique Values for each variable.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# box plot of imdb_score
plt.figure(figsize=(8, 6))
sns.boxplot(data=data2, y='imdb_score')
plt.title('Box Plot of imdb_score\n', color='brown')
plt.show()

In [None]:
# box plot of imdb_score
plt.figure(figsize=(8, 6))
sns.boxplot(data=data2, x='runtime')
plt.title('Box Plot of runtime\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot because it is the best chart to detect outliers and understand the distribution, spread, and variability of numerical data. It gives a quick statistical summary and helps in identifying extreme values that may impact analysis.


##### 2. What is/are the insight(s) found from the chart?

The box plot of IMDB scores shows that most titles fall between 5 and 8, indicating average-quality content. The presence of outliers shows that a few titles are extremely highly or poorly rated. Overall, the distribution is slightly skewed with only a small number of high-rated titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help create a positive business impact by identifying high-performing content, improving recommendations, and guiding content strategy. However, the dominance of average scores and the presence of low-rated outliers may lead to negative growth, as inconsistent content quality can reduce user satisfaction and long-term engagement.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Histogram of release date
plt.figure(figsize=(8,6))
sns.histplot(data=data2, x ='release_year', bins = 20, color = 'blue', kde = True)
plt.title('Histogram of release date\n', color='brown')
plt.show()


In [None]:
# Histogram run time
plt.figure(figsize=(8,6))
sns.histplot(data=data2, x ='runtime', bins = 25, color = 'red', kde = True)
plt.title('Histogram of run time\n', color='brown')
plt.show()

In [None]:
#histogram of seasons
plt.figure(figsize=(8,6))
sns.histplot(data=data2, x ='seasons', bins = 20, color = 'green', kde = True)
plt.title('Histogram of seasons\n', color='brown')
plt.show()

In [None]:
#histogram of imdb_score
plt.figure(figsize=(8,6))
sns.histplot(data=data2, x ='imdb_score', bins = 30, color = 'orange', kde = True)
plt.title('Histogram of imdb_score\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I selected the histogram because it clearly shows the frequency distribution of numerical data. It helps identify skewness, common value ranges, and patterns in popularity and scores that a box plot alone cannot show. This makes it ideal for understanding how the data is spread across different bins.

##### 2. What is/are the insight(s) found from the chart?

The histogram of IMDB scores shows that most titles have scores between 5 and 8, indicating moderate to good audience reception. Very few titles have extremely high or low scores. This suggests that the overall content quality on Amazon Prime is balanced but not exceptional.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the IMDB score insights help improve content strategy, recommendations, and user engagement, leading to positive business impact. However, the dominance of average-scoring titles and the low number of high-rated titles may lead to negative growth, as users might prefer platforms with consistently higher-quality content.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#bar plot/ count plot
a= data2.type.value_counts()
print(a)

In [None]:
#bar plot of type
plt.figure(figsize=(8,6))
sns.barplot(x=a.index, y=a.values, width=0.5, color='blue' ,edgecolor = 'black')
plt.title('Bar Plot of type(movie/tv show)\n', color='brown')
plt.ylabel('count')
plt.show()


In [None]:
# Bar plot of age_certification
b = data2.age_certification.value_counts()
print(b)

In [None]:
#bar plot age certification
plt.figure(figsize=(8,6))
sns.barplot(x=b.index, y=b.values, color='green' ,edgecolor = 'black')
plt.title('Bar Plot of age_certification\n', color='brown')
plt.ylabel('count')
plt.show()

In [None]:
genre_counts = data2['genres'].astype(str).value_counts()
c = genre_counts[genre_counts > 100]

plt.figure(figsize=(8,6))
sns.barplot(y=c.index, x=c.values, color='red', edgecolor='black', orient='h')
plt.title('Bar Plot of Genres', color='brown')
plt.xlabel('Count')
plt.show()

In [None]:

# bar plot prediction countries
prod_countries= data2['production_countries'].value_counts()
countries = prod_countries[prod_countries > 100]
countries


In [None]:
#bar plot of prediction countries
plt.figure(figsize=(8,6))
sns.barplot(x=countries.values, y=countries.index, color='orange', edgecolor='black', orient='h')
plt.title('Bar Plot of Prediction Countries', color='brown')
plt.xlabel('Count')
plt.show()

In [None]:
# bar plot of avrage rating of each age cetification
certification= data2.groupby('age_certification')['imdb_score'].mean().sort_values(ascending=False)
certification


In [None]:
# bar plot of averge rating for each age certification
plt.figure(figsize=(8,6))
sns.barplot(x=certification.values, y=certification.index, color='purple', edgecolor='black', orient='h')
plt.title('Bar Plot of Average Rating for Each Age Certification', color='brown')
plt.xlabel('Average Rating')
plt.show()

##### 1. Why did you pick the specific chart?

I selected the bar plot because it is the best chart to visualize and compare the frequency of categorical variables such as genres, ratings, and content types. It helps identify the most and least common categories, making it useful for understanding user preferences and platform content distribution.

##### 2. What is/are the insight(s) found from the chart?

The insights help Amazon Prime improve its content strategy by focusing more on mature-rated, highly rated titles and strengthening partnerships in high-producing regions like the USA and India. Genre popularity patterns, especially Drama and Comedy dominance, guide better content acquisition and recommendation strategies. Overall, these insights drive higher engagement, better user satisfaction, and improved business performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help Amazon Prime improve its content strategy by focusing more on mature-rated, highly rated titles and strengthening partnerships in high-producing regions like the USA and India. Genre popularity patterns, especially Drama and Comedy dominance, guide better content acquisition and recommendation strategies. Overall, these insights drive higher engagement, better user satisfaction, and improved business performance.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Baivarite Analysis
# Scattter plot
sns.scatterplot(data=data2, y='imdb_score', x='imdb_votes', color = 'blue')
plt.title('Scatter Plot of imdb_score vs imdb_votes\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I selected the scatter plot because it is the best chart to analyze the relationship between two numerical variables. It helps identify patterns, correlations, and outliers between IMDB Score and IMDB Votes, giving insights into how audience ratings relate to audience engagement.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows a positive relationship between IMDB Score and IMDB Votes—higher-rated titles generally receive more audience engagement. Most high-voted titles fall in the 6–8 score range, while low-scoring titles attract fewer votes. A few outliers indicate underrated or overrated titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights positively impact the business by identifying high-engagement, high-quality content and improving recommendation and content-investment strategies. However, outliers showing high votes but low scores indicate that popular content can still perform poorly, which may affect user trust. Also, many low-score titles receiving low engagement suggest the presence of weak content that can negatively affect platform quality.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#line plot of runtime vs release year
plt.figure(figsize=(8,6))
sns.lineplot(data=data2, x='release_year', y='runtime', color='red')
plt.title('Line Plot of Runtime vs Release Year\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I selected the line plot because it is ideal for visualizing how runtime changes over release years. It helps identify trends, fluctuations, and long-term patterns over time, making it the best chart to study time-based data such as runtime evolution.

##### 2. What is/are the insight(s) found from the chart?

The line plot shows that runtime changes significantly across release years. Older movies tend to have higher runtimes, while modern titles show more variation and often shorter durations. The fluctuations and spikes indicate changing audience preferences and diverse content formats across years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help Amazon Prime by showing runtime trends that can guide content acquisition, viewer targeting, and production planning, leading to improved user engagement. However, declining runtimes or extreme fluctuations may indicate inconsistent content length or lower content depth, which can negatively affect user satisfaction and long-term platform growth.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#line plot of seasons vs release year
plt.figure(figsize=(8,6))
sns.lineplot(data=data2, x='release_year', y='seasons', color='green')
plt.title('Line Plot of Seasons vs Release Year\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the line plot because it clearly shows how the number of seasons changes over different release years. It is the best chart to observe trends over time and identify patterns in TV show season counts.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that older shows generally have more seasons, while recent shows tend to have fewer seasons. There are also fluctuations and occasional spikes, indicating changes in production style and viewer preferences over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help guide content strategy by showing the trend toward shorter, modern series, which can improve viewer engagement. However, fewer long-running shows may reduce long-term user retention, which could negatively affect platform loyalty

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# line plot imdb_score vs release_year
plt.figure(figsize=(8,6))
sns.lineplot(data=data2, x='release_year', y='imdb_score', color='orange')
plt.title('Line Plot of imdb_score vs release_year\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the line plot because it clearly shows how IMDB scores change over different release years. It is ideal for identifying long-term trends and patterns over time.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that IMDB scores fluctuate across release years, with no constant upward or downward pattern. Some years have higher-rated movies, while others show dips, indicating inconsistent content quality over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help understand which years produced higher-quality content, guiding better content acquisition and recommendation decisions. However, inconsistent IMDB score trends may indicate uneven content quality, which could reduce user satisfaction and lead to negative growth if not addressed.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#line plot of tmdb_score vs release_year
plt.figure(figsize=(8,6))
sns.lineplot(data=data2, x='release_year', y='tmdb_score', color='purple')
plt.title('Line Plot of tmdb_score vs release_year\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the line plot because it clearly shows how TMDB scores change over the release years. It is the best chart to observe score trends and patterns across time.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that TMDB scores fluctuate across different years, with some years having higher average scores and others showing dips. This indicates that content quality varies over time rather than following a consistent trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help identify which years produced stronger content, supporting better content selection and recommendations. However, inconsistent TMDB scores across years may signal uneven content quality, which can reduce user trust and negatively affect viewer satisfaction.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Pie Chart
# pie chart of type (Movie or TV show )
plt.figure(figsize=(8,6))
plt.pie(data2['type'].value_counts(), labels=data2['type'].value_counts().index, autopct='%1.2f%%')
plt.title('Pie Chart of Type (Movie or TV Show)', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the pie chart because it clearly shows the proportion of Movies vs TV Shows in the dataset. It is the best chart for comparing percentage share of categories in a visual and easy-to-understand way.

##### 2. What is/are the insight(s) found from the chart?

The pie chart shows that Movies form the majority of the content on the platform, while TV Shows make up a smaller percentage. This indicates that Amazon Prime focuses more on movie content than series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help the platform understand its content balance and decide whether to invest more in TV Shows or maintain its movie-focused strategy. However, a low proportion of TV Shows may lead to negative growth if users prefer episodic content, reducing engagement and long-term retention.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#pie chaart por age certification
a = data2.age_certification.value_counts()
print(a)

In [None]:
# pie chart of age certification
import plotly.express as px
fig = px.pie(data2, values=a.values, names=a.index, title='Pie Chart of Age Certification')
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 12 visualization code
# A Violin plot
# Violin plot of imdb_score by type
plt.figure(figsize=(10,6))
sns.violinplot(data=data2, x='type', y='imdb_score')
plt.title('Violin Plot of imdb_score by type\n', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the violin plot because it shows both the distribution and density of IMDB scores for each content type. It provides more detailed insight than a box plot by showing how scores are spread and where most values are concentrated

##### 2. What is/are the insight(s) found from the chart?

The violin plot shows that both Movies and TV Shows have similar score ranges, but movies usually have a wider spread and more density around mid-level scores. TV shows may show higher density around slightly higher scores, indicating more consistent ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help identify which type (Movies or TV Shows) maintains stronger audience ratings, guiding content investment and recommendations. However, if movies show inconsistent or lower ratings, it may negatively impact user satisfaction and require improving movie quality.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# HEATMAP
num_cols = data2.select_dtypes(include=['int64','float64'])

In [None]:
#heatmap
plt.figure(figsize=(10,8))
sns.heatmap(num_cols.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap', color='brown')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the correlation heatmap because it visually shows how strongly different numerical variables are related. It helps identify positive/negative correlations quickly using color intensity.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows which variables are strongly or weakly correlated. Some features may show mild positive correlation, while others have almost no relationship. It also highlights variables that do not influence each other at all.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart - 13 visualization code
#Multivariable Analysis
# PAIR  Plot
sns.pairplot(data2, hue='type')
plt.show()

##### 1. Why did you pick the specific chart?

I chose the pair plot because it allows multiple numerical variables to be visualized together. It helps identify relationships, correlations, and patterns between multiple features at the same time.


##### 2. What is/are the insight(s) found from the chart?

The pair plot shows how different numerical variables relate to each other. It reveals which variables have positive or negative correlation and highlights clusters, trends, or outliers across multiple features.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The client should use the insights from EDA to focus more on popular genres and countries, improve recommendation systems, and optimize their content acquisition strategy. By fixing data quality issues and analyzing user trends, the platform can improve engagement, attract more viewers, and make better business decisions.Answer Here.

# **Conclusion**

The EDA of the Amazon Prime dataset helped us understand the distribution of movies and TV shows, popular genres, release trends, and missing data issues. After cleaning the dataset and analyzing key features, we gained clear insights into the platform’s content structure. Overall, the project shows that EDA is essential for revealing patterns, improving data quality, and supporting better decision-making for content strategy.Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***