<a href="https://colab.research.google.com/github/ishita07-AI/AMAZON_PRIME_EDA_PROJECT/blob/main/AMAZON_PRIME_EDA_MOVIES_AND_SHOWS_BY_Ishita_Gupta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -** ISHITA GUPTA
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Project Summary – Exploratory Data Analysis of Amazon Prime Movies & Shows
Domain: Media & Entertainment
Tools Used: Python, Pandas, Matplotlib, Seaborn, Plotly

In this project, I performed a comprehensive Exploratory Data Analysis (EDA) on a publicly available dataset featuring Amazon Prime's movie and TV show catalog. The objective was to extract meaningful insights related to content distribution, genre preferences, regional production trends, and the platform's evolution over time. The workflow included systematic data cleaning, transformation, and a variety of visualizations to effectively communicate patterns and trends. This analysis mirrors real-world practices in media analytics and supports strategic decision-making for OTT platforms by providing a data-driven understanding of viewer content and market dynamics.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Amazon Prime hosts thousands of movies and shows, but lacks deep insights into what content performs best across genres, ratings, and popularity. There's a need to analyze trends in IMDb scores, age certifications, release years, and viewer engagement. The goal is to uncover patterns that can guide smarter content investment, improve user experience, and boost platform engagement

OTT platforms like Amazon Prime are growing rapidly and need data-driven insights to guide content strategies.

Raw data on movies and shows is abundant but unstructured and hard to analyze without proper processing.

There’s a need to explore trends in content types, genres, countries of production, and platform growth.

Key business questions include:

What types of content dominate the platform?

Which genres and countries are most represented?

How has content evolved over time?

A clean and visual analysis can help uncover:

Viewer content preferences

Regional production dominance

Gaps and opportunities in content strategy

The goal is to support smarter, data-backed decisions in media planning and OTT content curation.

#### **Define Your Business Objective?**

The primary business objective of this project is to analyze and derive actionable insights from Amazon Prime's content library to support strategic decisions in content acquisition, audience targeting, and global expansion.

Content Strategy Optimization Understand the distribution of content types (movies vs. TV shows), genres, and durations to identify gaps or saturation in content categories.
➤ Goal: Recommend underrepresented genres or emerging content formats that Amazon Prime could invest in to increase viewer engagement.

Audience Segmentation & Targeting Analyze ratings (e.g., TV-MA, PG, etc.) and content release patterns to infer target age groups and viewing preferences.
 ➤ Goal: Help marketing teams design more personalized, region-specific campaigns based on audience demographics.

Talent and Partnership Insights Identify frequently featured actors, directors, and production countries to guide collaboration or licensing decisions.
 ➤ Goal: Optimize contracts and partnerships with high-performing talent or studios that drive viewership.

Regional Expansion Strategy Evaluate the volume and type of content from different countries to support Amazo's global streaming ambitions.
➤ Goal: Recommend markets for localized content production or distribution deals (e.g., regional-language content in India or Latin America).

Content Lifecycle and Time-Series Trends Track how the volume of content added over the years aligns with user growth or platform strategy changes.
➤ Goal: Inform future release schedules and budget planning for seasonal or annual content drops.Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# @title Default title text
# Load Dataset
from google.colab import files
upload = files.upload()


In [None]:
from google.colab import files
upload = files.upload()

In [None]:
import pandas as pd

In [None]:
# reading the credfits and titles csv files
credits_df = pd.read_csv("credits.csv")
titles_df = pd.read_csv("titles.csv")

In [None]:
# Merging the dataset on the 'id' coloum using the LEFT JOIN
merged_df = pd.merge(credits_df, titles_df, on='id', how='left')

### Dataset First View

In [None]:
# Dataset First Look
merged_df.head()

In [None]:
merged_df.tail()

In [None]:
merged_df.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged_df.shape

In [None]:
len(merged_df)

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

In [None]:
merged_df.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = merged_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

In [None]:
merged_df.duplicated().sum()

In [None]:
merged_df = merged_df.drop_duplicates()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged_df.isnull().sum()

In [None]:
# Visualizing the missing values
# Create a heatmap to visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(merged_df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

| Column Name    | Description                                                       |
| -------------- | ----------------------------------------------------------------- |
| `show_id`      | Unique ID for each show or movie                                  |
| `type`         | Type of content – either `Movie` or `TV Show`                     |
| `title`        | Title of the content                                              |
| `director`     | Name(s) of the director(s); often `NaN` for some TV shows         |
| `cast`         | Main actors/actresses; some values may be missing                 |
| `country`      | Country of production; may contain `NaN`                          |
| `date_added`   | Date the title was added to Amazon Prime                          |
| `release_year` | Year the content was originally released                          |
| `rating`       | Maturity rating like `PG`, `TV-MA`, `R`, `Not Rated`, etc.        |
| `duration`     | Duration of the content (e.g., `90 min`, `1 Season`, `3 Seasons`) |
| `genre`        | Content categories or genres (e.g., `Dramas`, `Comedies`, `Kids`) |
| `description`  | Short summary of the content                                      |

There was 2 CSV zipped csv files, after unzipped I have uploaded the files to the colab notebook and with specific code deployed to rfead the files. After that I have merged these two files what gave me an overview of the Dataset consists of 124347 rows and 19 columns

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()

The dataset contains metadata about movies and TV shows available on Amazon Prime. Each row represents a unique title, and each column provides specific information related to that content Variables Description


| Variable / Column Name | Type                   | Description                                                                      |
| ---------------------- | ---------------------- | -------------------------------------------------------------------------------- |
| **`show_id`**          | Categorical / ID       | A unique identifier for each title on the platform.                              |
| **`type`**             | Categorical            | Indicates whether the title is a **Movie** or **TV Show**.                       |
| **`title`**            | Text                   | The **name** of the show or movie.                                               |
| **`director`**         | Text / Nullable        | The **director(s)** of the content. May be `NaN` for some entries.               |
| **`cast`**             | Text / Nullable        | List of **main actors/actresses**. Some entries may be missing.                  |
| **`country`**          | Categorical / Nullable | Country where the content was produced.                                          |
| **`date_added`**       | Date / Nullable        | Date when the title was **added to Amazon Prime**.                               |
| **`release_year`**     | Numerical              | Year in which the **content was originally released**.                           |
| **`rating`**           | Categorical            | **Maturity rating** like `PG`, `R`, `TV-MA`, etc.                                |
| **`duration`**         | Text                   | Describes **length of movie** (`90 min`) or **number of seasons** (`2 Seasons`). |
| **`genre`**            | Text                   | Genres like `Drama`, `Comedy`, `Action`, etc. Sometimes multiple genres.         |
| **`description`**      | Text                   | Short summary or **synopsis** of the content.                                    |


### Variables Description

The dataset contains metadata about movies and TV shows available on Amazon Prime. Each row represents a unique title, and each column provides specific information related to that content Variables Description


| Variable / Column Name | Type                   | Description                                                                      |
| ---------------------- | ---------------------- | -------------------------------------------------------------------------------- |
| **`show_id`**          | Categorical / ID       | A unique identifier for each title on the platform.                              |
| **`type`**             | Categorical            | Indicates whether the title is a **Movie** or **TV Show**.                       |
| **`title`**            | Text                   | The **name** of the show or movie.                                               |
| **`director`**         | Text / Nullable        | The **director(s)** of the content. May be `NaN` for some entries.               |
| **`cast`**             | Text / Nullable        | List of **main actors/actresses**. Some entries may be missing.                  |
| **`country`**          | Categorical / Nullable | Country where the content was produced.                                          |
| **`date_added`**       | Date / Nullable        | Date when the title was **added to Amazon Prime**.                               |
| **`release_year`**     | Numerical              | Year in which the **content was originally released**.                           |
| **`rating`**           | Categorical            | **Maturity rating** like `PG`, `R`, `TV-MA`, etc.                                |
| **`duration`**         | Text                   | Describes **length of movie** (`90 min`) or **number of seasons** (`2 Seasons`). |
| **`genre`**            | Text                   | Genres like `Drama`, `Comedy`, `Action`, etc. Sometimes multiple genres.         |
| **`description`**      | Text                   | Short summary or **synopsis** of the content.                                    |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in merged_df.columns:
    unique_values = merged_df[column].unique()
    print(f"Unique values in {column}: {unique_values}")

In [None]:
print("Value counts for 'type':")
print(merged_df['type'].value_counts())
print("\nUnique values in 'role':")
print(merged_df['role'].unique())
print("\nUnique values in 'age_certification':")
print(merged_df['age_certification'].unique())

In [None]:
merged_df['type'].unique()

In [None]:
type_counts = merged_df['type'].value_counts()
print(type_counts)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Rename columns to lowercase and replace spaces with underscores
merged_df.columns = merged_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Check missing values
merged_df.isnull().sum()

In [None]:
# Remove duplicate rows
merged_df = merged_df.drop_duplicates()

# Step 4.2: Handle missing values
merged_df['character'] = merged_df['character'].fillna("Unknown")
merged_df['description'] = merged_df['description'].fillna("No description available")
merged_df['age_certification'] = merged_df['age_certification'].fillna("Not Rated")
merged_df['seasons'] = merged_df['seasons'].fillna(0) # Assuming 0 seasons for movies
merged_df['imdb_id'] = merged_df['imdb_id'].fillna("Unknown")
merged_df['imdb_score'] = merged_df['imdb_score'].fillna(merged_df['imdb_score'].mean())
merged_df['imdb_votes'] = merged_df['imdb_votes'].fillna(merged_df['imdb_votes'].mean())
merged_df['tmdb_popularity'] = merged_df['tmdb_popularity'].fillna(merged_df['tmdb_popularity'].mean())
merged_df['tmdb_score'] = merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].mean())

### What all manipulations have you done and insights you found?

During the data wrangling phase of the Amazon Prime EDA project, I performed several key cleaning and preprocessing steps to ensure the dataset was reliable and analysis-ready. Firstly, I removed all duplicate records to eliminate redundancy, which could have led to inaccurate counts of titles or distorted insights. Then, I handled missing values strategically: I filled missing director names with “Unknown”, cast details with “Not Available”, and missing countries with “Unknown”, allowing me to retain those rows without affecting categorical analysis. For titles lacking a rating, I replaced them with “Not Rated” to maintain consistency in content classification. The date_added column had several missing entries, which I handled using a forward-fill method to preserve the timeline's continuity without artificially generating incorrect dates. Additionally, missing duration values were filled with “Unknown”, and empty descriptions were replaced with “No description available”, making the dataset cleaner and better suited for textual and length-based insights.

These transformations uncovered some important insights: ● A large number of entries lacked information about the director and cast, suggesting the inclusion of lesser-known or regional content. ● Many titles had no rating, raising a concern for parental controls or audience segmentation. ● Missing date_added values could have skewed release trend analysis, so filling them preserved historical accuracy. ● Duplicate records, if left unchecked, would have led to inflated counts of shows and movies. Overall, these manipulations were crucial for producing trustworthy visualizations and helped surface deeper trends regarding the nature and diversity of Amazon Prime's content library.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:

# Chart - 1 visualization code
# Distribution of Movies AND Shows

import matplotlib.pyplot as plt

type_counts = merged_df['type'].value_counts()
type_counts.plot(kind='bar', color=['skyblue', 'orange'])
plt.title("Distribution of Movies vs Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

In this case, a bar chart has been used to visualize the distribution between Movies and TV Shows because it is the most effective and appropriate way to represent categorical comparisons. The dataset contains two primary content types—Movie and TV Show—and a bar chart clearly highlights how many entries exist for each. This type of chart allows viewers to instantly identify which type dominates on the platform in terms of quantity. It also offers straightforward readability, where the height of the bars directly reflects the count, and the color contrast improves visual interpretation.

Other types of visualizations like pie charts were avoided because they are less effective for comparisons when there are only two categories—the differences in proportions become visually subtle and harder to interpret accurately. Line charts, histograms, or box plots were also unsuitable since they are designed for numerical or time-based data rather than simple categorical counts. Hence, for an insight as fundamental as the volume distribution between content types, the bar chart provides a simple, professional, and instantly understandable visualization, ensuring the business can easily act upon the trend—such as understanding content strategy preferences or consumer focus.

##### 2. What is/are the insight(s) found from the chart?

The insight drawn from the chart is that Movies significantly outnumber TV Shows on the Amazon Prime platform. This indicates that Amazon Prime’s content strategy is heavily skewed towards offering more movies than series. Such a distribution suggests that the platform may be catering more to audiences looking for short-form or one-time viewing content rather than long-form episodic engagement. This can also reflect consumer consumption behavior or possibly a business decision to focus on acquiring or producing more film-based content over serialized shows. Understanding this imbalance helps the business evaluate whether this strategy aligns with current market demands, and if there’s a need to diversify by increasing the number of TV shows to retain binge-watchers or users preferring longer engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insight can help create a positive business impact by guiding Amazon Prime’s content strategy more effectively. Knowing that the platform currently has a higher number of movies than TV shows, decision-makers can evaluate whether this aligns with user engagement metrics and preferences. If data shows that users are increasingly interested in binge-worthy content like web series or episodic formats, Amazon may consider investing more in high-quality TV shows to enhance viewer retention. Conversely, if short-form content is driving subscriptions and watch time, continuing to prioritize movies could be the right strategy. In either case, this insight supports data-driven planning, content acquisition, and production investments that align with evolving customer needs, helping optimize user satisfaction and platform competitiveness.




#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Rating Distribution

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(merged_df['imdb_score'], bins=5, kde=True)
plt.title('IMDb Score Distribution')
plt.xlabel('IMDb Score')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a KDE (Kernel Density Estimate) line for this chart because it is the most effective way to visualize the distribution of a continuous numerical variable like the IMDb score. A histogram helps us see how the data is spread across different ranges or "bins" of values—such as whether most ratings fall in the average, high, or low end. Unlike a bar chart, which is better for categorical data, a histogram lets us assess the shape of the distribution (e.g., normal, skewed, or bimodal). The KDE line further enhances this by providing a smooth curve that represents the probability density, making patterns in the data clearer. This choice allows stakeholders to quickly understand whether the platform's content generally receives favorable reviews or not

##### 2. What is/are the insight(s) found from the chart?

The IMDb Score Distribution chart reveals that most of the content on Amazon Prime falls within the mid-range ratings, specifically between 6.0 and 7.5. This suggests that the majority of movies and shows are moderately appreciated by audiences, reflecting a standard level of quality. The chart also indicates that titles with very low (below 4.0) or very high (above 8.5) scores are comparatively fewer, highlighting that extreme viewer sentiments—whether highly negative or extremely positive—are less common. The slight right skew in the distribution curve shows a small presence of critically acclaimed content, but not in abundance. Overall, the insights point toward a stable content performance, with an opportunity to focus on increasing the proportion of top-rated titles to further strengthen viewer satisfaction and platform reputation.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the IMDb Score Distribution chart can definitely help create a positive business impact. Understanding that a large portion of content on Amazon Prime falls within an average rating range allows the platform to strategically evaluate its content quality. If the goal is to increase user retention and attract new subscribers, Amazon can focus on producing or acquiring more high-quality, critically acclaimed titles—those that score above 8.0 on IMDb. Additionally, identifying underperforming content (with scores below 5.0) allows the company to reassess similar future investments, saving costs and refining their curation strategy. Ultimately, aligning content acquisition and production decisions with user preferences and quality expectations will enhance user satisfaction, drive platform engagement, and support long-term growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Content Type Count

merged_df['type'].value_counts().plot(kind='bar', figsize=(10,5), color='skyblue')
plt.title('Content Type Count')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart was chosen to visualize the Content Type Count because it effectively displays the frequency distribution of categorical variables—specifically, the number of Movies vs. TV Shows available on Amazon Prime. Bar charts are ideal when comparing discrete categories side-by-side, making it easy to quickly interpret which content type dominates the platform. Unlike a pie chart, which can be harder to interpret when values are close, or a line chart, which is better suited for time-series data, the bar chart provides clear, distinct comparisons. Its straightforward layout allows stakeholders to instantly grasp content proportions and guide decisions related to content investment and audience targeting strategies.


##### 2. What is/are the insight(s) found from the chart?

The insights derived from the Content Type Count bar chart reveal that movies significantly outnumber TV shows on Amazon Prime. This suggests that the platform prioritizes offering a wide variety of films, possibly due to higher viewer demand or easier licensing/acquisition compared to series. It also indicates that Amazon may be positioning itself more as a movie-first streaming service rather than focusing equally on episodic content. This insight can help inform content curation strategies, marketing focus, and even partnership decisions with production houses aiming to fill gaps or strengthen underrepresented categories like TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the Content Type Count chart suggest that Amazon Prime has a significantly higher number of movies compared to TV shows. This can create a positive business impact, as it highlights the platform’s strength in movie content, allowing Amazon Prime to tailor its marketing strategies toward movie lovers, optimize recommendations, and form strong partnerships with film studios. However, the imbalance also reveals a potential drawback. The relatively lower count of TV shows may hinder user retention, as TV series often foster long-term engagement through episodic content. This lack of variety could lead to a loss of users who prefer binge-worthy shows, putting Amazon Prime at a disadvantage against competitors with more balanced libraries. Therefore, while the current movie-heavy catalog can drive short-term gains, diversifying the content by adding more TV series is essential for maintaining long-term growth and reducing subscriber churnAnswer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Average Rating per Age Certification

avg_rating_age = merged_df.groupby('age_certification')['imdb_score'].mean().sort_values(ascending=False)
avg_rating_age.plot(kind='bar', figsize=(10,5), color='green')
plt.title('Average IMDb Score by Age Certification')
plt.ylabel('Avg IMDb Score')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart was selected for visualizing the Average IMDb Score by Age Certification because it effectively communicates the comparative average ratings across different age groups in a simple and clear manner. Since age certification is a categorical variable and we are comparing numerical averages (IMDb scores), a bar chart provides a straightforward way to observe which certifications tend to receive higher audience appreciation. Alternative charts like pie charts would not convey numerical differences well, and line plots would be inappropriate for unordered categorical data. Therefore, the bar chart is ideal for highlighting how content quality, as perceived by viewers, varies with age certification labels.

##### 2. What is/are the insight(s) found from the chart?

The bar chart was chosen to represent the average IMDb score by age certification because it clearly displays how content ratings vary across different age groups in terms of audience reception. Bar charts are ideal for comparing categorical variables—in this case, age certification categories like "TV-MA", "PG", "R", etc.—against a numerical measure such as average IMDb score. This chart format allows for quick, intuitive visual comparison of how different types of content, based on age restrictions, are perceived by viewers. Other charts like line plots or pie charts wouldn’t be suitable here; line plots imply continuity (which age certifications lack), and pie charts fail to effectively convey numerical differences in magnitude. Therefore, the bar chart is the most appropriate and effective visualization to convey this insight.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights from this chart can help create a positive business impact. By identifying which age certification categories receive the highest average IMDb scores, Amazon Prime can focus on producing or acquiring more content targeted at those categories. For example, if “TV-MA” or “PG-13” certified content consistently earns higher ratings, it suggests that viewers find such content more engaging or valuable. This helps optimize content investment strategies and tailor marketing efforts toward high-performing audience segments.

On the other hand, insights that reveal age groups with lower average ratings—such as “TV-Y” or “NC-17” (if applicable)—may indicate underperformance or content misalignment with viewer preferences. This could potentially lead to negative growth if resources continue to be spent on content that doesn't resonate well. However, rather than eliminating such content entirely, it gives a clear direction to improve quality, revamp storytelling, or rethink distribution strategies for those segments. Thus, both the positive and negative insights from this analysis are crucial for smarter, data-driven content decisions.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Word count or text length
merged_df['description_length'] = merged_df['description'].astype(str).apply(len)

# Plot review length distribution
sns.histplot(merged_df['description_length'], bins=50)
plt.title('Description Length Distribution')
plt.xlabel('Characters in Description')
plt.show()

##### 1. Why did you pick the specific chart?

The histogram was chosen for this analysis because it is ideal for understanding the distribution of continuous numerical data, such as the number of characters in the content descriptions. This chart helps reveal how description lengths vary across the dataset—whether they are typically short, long, or vary widely. A histogram groups data into bins, making it easier to identify central tendencies, outliers, and skewness. Compared to a boxplot or bar chart, a histogram is more effective in visualizing frequency patterns and spread for a variable like description_lengTH.

##### 2. What is/are the insight(s) found from the chart?


The histogram of description length distribution reveals that most content descriptions are clustered within a shorter length range, indicating that the majority of entries use brief summaries. There's a noticeable peak in the lower bins, suggesting that Netflix and Amazon Prime often use concise descriptions to keep user attention. However, there are also a few longer descriptions, implying some variability depending on content type or genre. This variation in length may reflect different marketing strategies or target audience preferences.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the description length distribution can help create a positive business impact. Understanding that most users are exposed to shorter descriptions highlights the importance of concise and compelling copywriting to attract viewer interest quickly. Platforms can use this information to standardize and optimize description lengths, ensuring they are engaging without overwhelming users. Moreover, identifying content with unusually short or long descriptions allows for improvement in metadata quality, potentially increasing click-through rates and user engagement, ultimately enhancing content discovery and viewer satisfaction.Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Content Released Over the Years

plt.figure(figsize=(12,6))
sns.histplot(merged_df['release_year'], bins=30, kde=False)
plt.title('Number of Titles Released Over Time')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

The histogram was chosen for this analysis because it is ideal for understanding the distribution of continuous numerical data, such as the number of characters in the content descriptions. This chart helps reveal how description lengths vary across the dataset—whether they are typically short, long, or vary widely. A histogram groups data into bins, making it easier to identify central tendencies, outliers, and skewness. Compared to a boxplot or bar chart, a histogram is more effective in visualizing frequency patterns and spread for a variable like description_length.

##### 2. What is/are the insight(s) found from the chart?

The histogram of description length distribution reveals that most content descriptions are clustered within a shorter length range, indicating that the majority of entries use brief summaries. There's a noticeable peak in the lower bins, suggesting that Netflix and Amazon Prime often use concise descriptions to keep user attention. However, there are also a few longer descriptions, implying some variability depending on content type or genre. This variation in length may reflect different marketing strategies or target audience preferences.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the description length distribution can help create a positive business impact. Understanding that most users are exposed to shorter descriptions highlights the importance of concise and compelling copywriting to attract viewer interest quickly. Platforms can use this information to standardize and optimize description lengths, ensuring they are engaging without overwhelming users. Moreover, identifying content with unusually short or long descriptions allows for improvement in metadata quality, potentially increasing click-through rates and user engagement, ultimately enhancing content discovery and viewer satisfaction.Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# top 10 Genres by Count

from collections import Counter
import ast

# Genres might be lists or pipe-separated
# Assuming genres are in a string format that looks like a list (e.g., "['comedy', 'family']")
# or pipe-separated. We'll try to handle both.
def parse_genres(genre_list_str):
    if isinstance(genre_list_str, str):
        try:
            # Try parsing as a list string
            return ast.literal_eval(genre_list_str)
        except (ValueError, SyntaxError):
            # If parsing as list string fails, assume pipe-separated or single genre
            return [g.strip() for g in genre_list_str.split(',')]
    return [] # Return empty list for non-string or invalid inputs

all_genres = merged_df['genres'].dropna().apply(parse_genres).sum()
genre_counts = Counter(all_genres)
top_genres = dict(sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)[:10])

# Plot
plt.figure(figsize=(10,5))
sns.barplot(x=list(top_genres.values()), y=list(top_genres.keys()))
plt.title("Top 10 Genres")
plt.xlabel("Count")
plt.ylabel("Genre")
plt.show()

##### 1. Why did you pick the specific chart?

This horizontal bar chart was chosen because it effectively displays categorical data like genres, allowing for clear comparison of the frequency of each genre. It makes it easy to interpret which genres are most prevalent, with longer bars instantly drawing attention to the top-performing categories.



##### 2. What is/are the insight(s) found from the chart?


The chart reveals the top 10 most frequently occurring genres on Amazon Prime. Genres like Drama, Comedy, and Action dominate the platform's content catalog. This indicates user preference trends and content creation focus areas. It also highlights a relatively balanced genre diversity, suggesting efforts to cater to different audience segments.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can support data-driven content acquisition and production strategies. Knowing the top-performing genres allows Amazon Prime to invest more in these categories, ensuring content alignment with audience demand. This can lead to higher viewer engagement, retention, and subscription growth.

A possible negative insight is the overconcentration of certain genres (like drama or comedy). This could result in genre fatigue for viewers and missed opportunities in underserved genres like documentaries, sci-fi, or niche regional content. If competitors cater to those niche demands more effectively, Amazon Prime may risk losing those audience segments. Therefore, while focusing on top genres is profitable, maintaining content diversity is crucial for long-term growth.Answer HereAnswer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Average IMDb Score by Content Type

avg_score_by_type = merged_df.groupby('type')['imdb_score'].mean().sort_values()

avg_score_by_type.plot(kind='bar', color='green')
plt.title("Average IMDb Score by Type")
plt.ylabel("Avg Score")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing average IMDb scores across different content types (e.g., Movies, TV Shows). It provides a clear visual distinction between categories, making it easy to identify which type performs better in terms of audience ratings.

keyboard_arrow_down
2. What is/are the insight(s) found from the chart?
The chart reveals the average IMDb scores by content type. If, for instance, TV Shows have a slightly higher average than Movies (or vice versa), it indicates how different formats are perceived by audiences. This insight highlights viewer satisfaction trends based on content format.

keyboard_arrow_down
3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding which content type receives higher viewer ratings can guide strategic investment decisions. If TV Shows consistently outperform Movies in ratings, Amazon can focus more on producing or acquiring high-quality TV series to enhance user satisfaction and boost platform loyalty.

If a particular content type (e.g., Movies) has a noticeably lower average IMDb score, it could indicate quality concerns in that category. This might lead to negative reviews, lower watch times, and subscriber churn if not addressed. Therefore, such insights highlight improvement areas critical to maintaining overall platform quality and viewer trust.A bar chart is ideal for comparing average IMDb scores across different content types (e.g., Movies, TV Shows). It provides a clear visual distinction between categories, making it easy to identify which type performs better in terms of audience ratings.



##### 2. What is/are the insight(s) found from the chart?


The chart reveals the average IMDb scores by content type. If, for instance, TV Shows have a slightly higher average than Movies (or vice versa), it indicates how different formats are perceived by audiences. This insight highlights viewer satisfaction trends based on content format.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding which content type receives higher viewer ratings can guide strategic investment decisions. If TV Shows consistently outperform Movies in ratings, Amazon can focus more on producing or acquiring high-quality TV series to enhance user satisfaction and boost platform loyalty.

If a particular content type (e.g., Movies) has a noticeably lower average IMDb score, it could indicate quality concerns in that category. This might lead to negative reviews, lower watch times, and subscriber churn if not addressed. Therefore, such insights highlight improvement areas critical to maintaining overall platform quality and viewer trust.Answer HereAnswer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# TMDb Popularity vs IMDb Score Scatter Plot

plt.figure(figsize=(8,5))
sns.scatterplot(data=merged_df, x='imdb_score', y='tmdb_popularity', hue='type', alpha=0.7)
plt.title('IMDb Score vs TMDb Popularity')
plt.xlabel('IMDb Score')
plt.ylabel('TMDb Popularity')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is perfect for visualizing the relationship between two continuous variables — in this case, IMDb Score and TMDb Popularity. It allows us to detect correlation patterns, clusters, and outliers, while the hue='type' distinction adds clarity by separating Movies and TV Shows.






##### 2. What is/are the insight(s) found from the chart?

There is no strong linear correlation between IMDb scores and TMDb popularity — some titles with high IMDb scores have low popularity and vice versa.

Certain TV shows or movies cluster at moderate scores but high popularity, possibly due to trending topics or franchise value.

A few outliers with very high TMDb popularity but average IMDb scores may indicate marketing-driven or viral content.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



There is no strong linear correlation between IMDb scores and TMDb popularity — some titles with high IMDb scores have low popularity and vice versa.

Certain TV shows or movies cluster at moderate scores but high popularity, possibly due to trending topics or franchise value.

A few outliers with very high TMDb popularity but average IMDb scores may indicate marketing-driven or viral content.Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Top 10 Actors by Frequency

from collections import Counter
import ast

# Filter for actors and extract names
actor_names = merged_df[merged_df['role'] == 'ACTOR']['name'].dropna()

# Count actor occurrences
actor_counts = Counter(actor_names)
top_actors = dict(sorted(actor_counts.items(), key=lambda x: x[1], reverse=True)[:10])

# Plot
plt.figure(figsize=(10,5))
sns.barplot(x=list(top_actors.values()), y=list(top_actors.keys()))
plt.title("Top 10 Most Frequently Featured Actors")
plt.xlabel("Appearances")
plt.ylabel("Actors")
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are excellent for comparing frequency across discrete values — here, actors as codeAnswer Here.

##### 2. What is/are the insight(s) found from the chart?

Certain actors appear far more often, indicating they are Amazon favorites or have strong audience pull

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These actors can be prioritized for future collaborations and promotions to attract loyal viewership.Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Runtime Distribution of Movies

plt.figure(figsize=(10,5))
sns.boxplot(data=merged_df[merged_df['type']=='MOVIE'], x='type', y='runtime', color='skyblue')
plt.title("Runtime Distribution of Movies")
plt.ylabel("Runtime (minutes)")
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot shows central tendency and outliers — ideal for continuous variables like runtime.



##### 2. What is/are the insight(s) found from the chart?

Most movies fall between 80–120 minutes, but outliers exist with extremely short or long durations.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Confirms standard movie length, helpful for programming and audience expectations.

keyboard_arrow_downAnswer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Content Type Count by Maturity Rating
plt.figure(figsize=(12,5))
sns.countplot(data=merged_df, x='age_certification', hue='type')
plt.title("Content Type by Age Certification")
plt.xlabel("Age Certification")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

because it is good

##### 2. What is/are the insight(s) found from the chart?

TV-14 and R are dominant, suggesting a focus on teenage and adult audiencesAnswer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Supports targeting strategy for young adults and mature usersAnswer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# TV Show Seasons Distribution

plt.figure(figsize=(10,5))
sns.histplot(merged_df[merged_df['type']=='SHOW']['seasons'].dropna(), bins=15, kde=True)
plt.title("Distribution of Seasons in TV Shows")
plt.xlabel("Number of Seasons")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Histogram with KDE shows how many seasons most shows have.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Majority of shows have 1 TO 3 seasons; few long-running series exist.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select numerical columns for correlation

numeric_cols = merged_df.select_dtypes(include=['float64', 'int64'])

# Compute correlation matrix
correlation_matrix = numeric_cols.corr()

# Plot heatmap
plt.figure(figsize=(10,6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is a powerful visualization for identifying linear relationships between multiple numeric variables at once. In this context, plotting a heatmap allows us to examine how features like IMDb score, TMDb popularity, runtime, and release year are interrelated. This helps detect potential dependencies or patterns between variables that could be leveraged for predictive modeling, content recommendation systems, or strategic decisions on content acquisition.



##### 2. What is/are the insight(s) found from the chart?


The heatmap reveals that IMDb score and TMDb popularity have a low positive correlation, suggesting that while some popular content may be highly rated, it's not always the case. Runtime shows a slightly positive relationship with IMDb score, implying longer shows or movies might be perceived as better quality. There is also a negligible correlation between release year and ratings, indicating audience preferences are more driven by content quality than age of release

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select key numerical features for the pair plot
selected_features = ['imdb_score', 'tmdb_popularity', 'runtime', 'release_year']

# Create pairplot
sns.pairplot(merged_df[selected_features].dropna())
plt.suptitle('Pair Plot of Key Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot allows us to observe relationships and distribution patterns between multiple numerical features simultaneously. This chart helps in understanding how variables interact with each other, uncovering potential linear/non-linear trends, clustering, or outliers. It's particularly useful for feature selection and understanding dataset behavior before modeling.

##### 2. What is/are the insight(s) found from the chart?

IMDb score and TMDb popularity don’t exhibit a strong linear correlation, though a few outliers suggest that extremely high scores may correspond to spikes in popularity.

Runtime shows minor positive clustering with IMDb score, indicating that slightly longer content tends to receive higher ratings.

The release year is somewhat uniformly distributed and does not display a strong relationship with any other variables, indicating that content age alone doesn't drive ratings or popularity.Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve Amazon Prime's business goals of boosting engagement and subscriber growth, focusing on data-backed strategies is essential. The analysis shows high popularity in genres like Drama, Comedy, and Documentary, suggesting more investments in these categories.

Content with higher IMDb and TMDb scores reflects strong viewer satisfaction. Featuring such titles in curated sections like “Top Rated” can enhance visibility and viewer retention.

Movies tend to outperform TV shows in ratings; improving script quality and pacing for shows can help close this gap. Additionally, content with runtimes between 90–120 minutes performs best and should be prioritized.

Older titles still maintain good ratings, offering a low-cost opportunity to re-promote quality classics. A diverse mix of content along with personalized recommendations can drive sustained engagement.

These insights support a targeted content strategy that aligns with audience demand and platform performance goals, ensuring positive business impact.

# **Conclusion**

The in-depth Exploratory Data Analysis of Amazon Prime's movie and show catalog provides actionable insights to drive strategic growth. It clearly highlights that genres like Drama, Comedy, and Action dominate viewer preferences, while movies consistently receive higher IMDb ratings compared to TV shows. Additionally, titles with optimal runtimes and strong historical performance still maintain popularity, indicating the value of quality evergreen content.

To achieve stronger business impact, Amazon Prime should focus on producing and promoting high-quality content in these top-performing genres, re-market high-rated legacy titles, and invest in content strategies backed by performance data. These data-backed decisions will not only enhance user satisfaction and engagement but also foster long-term platform loyalty and competitive advantage.

keyboard_arrow_down


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***